Semiautomated Relay Method and Apparatus

ABSTRACT

A system and method for presenting substantially simultaneous voice and text to an assisted user (AU) during a voice conversation between the AU and a hearing user (HU), the hearing user using an HU device to talk to the assisted user, the system comprising an AU captioned device including a device processor, a relay that includes a relay display, a relay speaker and a relay processor, wherein, at least one of the device processor and the relay processor is programmed to perform the steps of receiving an HU voice signal comprising a sequence of HU voice segments and assigning time stamps to each of the HU voice segments, wherein, the relay processor is programmed to perform the steps of generating text segments corresponding to each HU voice segment, storing each HU voice segment along with a corresponding text segment and a corresponding time stamp in a memory device, broadcasting the HU voice segments to a call assistant (CA) via the relay speaker and presenting each text segment via the relay display substantially contemporaneously with broadcast of the corresponding HU voice segment.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation-in-part of U.S. patent applicationSer. No. 15/171,720, filed on Jun. 2, 2017, and titled “SEMIAUTOMATEDRELAY METHOD AND APPARATUS”, which is a continuation-in-part of U.S.patent application Ser. No. 14/953,631, filed on Nov. 30, 2015, andtitled “SEMIAUTOMATED RELAY METHOD AND APPARATUS”, which is acontinuation-in-part of U.S. patent application Ser. No. 14/632,257,filed on Feb. 26, 2015, and titled “SEMIAUTOMATED RELAY METHOD ANDAPPARATUS”, which claims priority to U.S. provisional patent applicationSer. No. 61/946,072 filed on Feb. 28, 2014, and titled “SEMIAUTOMATEDRELAY METHOD AND APPARATUS”, each of which is incorporated herein in itsentirety by reference.

STATEMENT REGARDING FEDERALLY SPONSORED RESEARCH OR DEVELOPMENT

Not applicable.

BACKGROUND OF THE DISCLOSURE

The present invention relates to relay systems for providingvoice-to-text captioning for hearing impaired users and morespecifically to a relay system that uses automated voice-to-textcaptioning software to transcribe voice-to-text.

Many people have at least some degree of hearing loss. For instance, inthe United states, about 3 out of every 1000 people are functionallydeaf and about 17 percent (36 million) of American adults report somedegree of hearing loss which typically gets worse as people age. Manypeople with hearing loss have developed ways to cope with the ways theirloss effects their ability to communicate. For instance, many deafpeople have learned to use their sight to compensate for hearing loss byeither communicating via sign language or by reading another person'slips as they speak.

When it comes to remotely communicating using a telephone,unfortunately, there is no way for a hearing impaired person (e.g., anassisted user (AU)) to use sight to compensate for hearing loss asconventional telephones do not enable an assisted user to see a personon the other end of the line (e.g., no lip reading or sign viewing). Forpersons with only partial hearing impairment, some simply turn up thevolume on their telephones to try to compensate for their loss and canmake do in most cases. For others with more severe hearing lossconventional telephones cannot compensate for their loss and telephonecommunication is a poor option.

An industry has evolved for providing communication services to assistedusers whereby voice communications from a person linked to an assisteduser's communication device are transcribed into text and displayed onan electronic display screen for the assisted user to read during acommunication session. In many cases the assisted user's device willalso broadcast the linked person's voice substantially simultaneously asthe text is displayed so that an assisted user that has some ability tohear can use their hearing sense to discern most phrases and can referto the text when some part of a communication is not understandable fromwhat was heard.

U.S. Pat. No. 6,603,835 (hereinafter “the '835 patent) titled “SystemFor Text Assisted Telephony” teaches several different types of relaysystems for providing text captioning services to assisted users. Onecaptioning service type is referred to as a single line system where arelay is linked between an AU's device and a telephone used by theperson communicating with the AU. Hereinafter, unless indicatedotherwise the other person communicating with the assisted user will bereferred to as a hearing user (HU) even though the AU may in fact becommunicating with another assisted user. In single line systems, oneline links an HU device to the relay and one line (e.g., the singleline) links the relay to the AU device. Voice from the HU is presentedto a relay call assistant (CA) who transcribes the voice-to-text andthen the text is transmitted to the AU device to be displayed. The HU'svoice is also, in at least some cases, carried or passed through therelay to the AU device to be broadcast to the AU.

The other captioning service type described in the '835 patent is a twoline system. In a two line system a hearing user's telephone is directlylinked to an assisted user's device for voice communications between theAU and the HU. When captioning is required, the AU can select acaptioning control button on the AU device to link to the relay andprovide the HU's voice to the relay on a first line. Again, a relay CAlistens to the HU voice message and transcribes the voice message intotext which is transmitted back to the AU device on a second line to bedisplayed to the AU. One of the primary advantages of the two linesystem over one line systems is that the AU can add captioning to anon-going call. This is important as many AUs are only partially impairedand may only want captioning when absolutely necessary. The option tonot have captioning is also important in cases where an AU device can beused as a normal telephone and where non-assisted users (e.g., a spouseliving with an AU that has good hearing capability) that do not needcaptioning may also use the AU device.

With any relay system, the primary factors for determining the value ofthe system are accuracy, speed and cost to provide the service.Regarding accuracy, text should accurately represent voice messages fromhearing users so that an AU reading the text has an accurateunderstanding of the meaning of the message. Erroneous words provideinaccurate messages and also can cause confusion for an AU readingtranscribed text.

Regarding speed, ideally text is presented to an AU simultaneously withthe voice message corresponding to the text so that an AU sees textassociated with a message as the message is heard. In this regard, textthat trails a voice message by several seconds can cause confusion.Current systems present captioned text relatively quickly (e.g. 1-3seconds after the voice message is broadcast) most of the time. However,at times a CA can fall behind when captioning so that longer delays(e.g., 10-15 seconds) occur.

Regarding cost, existing systems require a unique and highly trained CAfor each communication session. In known cases CAs need to be able tospeak clearly and need to be able to type quickly and accurately. CAjobs are also relatively high pressure jobs and therefore turnover isrelatively high when compared jobs in many other industries whichfurther increases the costs associated with operating a relay.

One innovation that has increased captioning speed appreciably and thathas reduced the costs associated with captioning at least somewhat hasbeen the use of voice-to-text transcription software by relay CAs. Inthis regard, early relay systems required CAs to type all of the textpresented via an AU device. To present text as quickly as possible afterbroadcast of an associated voice message, highly skilled typists wererequired. During normal conversations people routinely speak at a ratebetween 110 and 150 words per minute. During a conversation between anAU and an HU, typically only about half the words voiced have to betranscribed (e.g., the AU typically communicates to the HU during halfof a session). This means that to keep up with transcribing the HU'sportion of a typical conversation a CA has to be able to type at around55 to 75 words per minute. To this end, most professional typists typeat around 50 to 80 words per minute and therefore can keep up with anormal conversation for at least some time. Professional typists arerelatively expensive. In addition, despite being able to keep up with aconversation most of the time, at other times (e.g., during longconversations or during particularly high speed conversations) evenprofessional typists fall behind transcribing real time text and moresubstantial delays can occur.

In relay systems that use voice-to-text transcription software trainedto a CA's voice, a CA listens to an HU's voice and revoices the HU'svoice message to a computer running the trained software. The software,being trained to the CA's voice, transcribes the re-voiced message muchmore quickly than a typist can type text and with only minimal errors.In many respects revoicing techniques for generating text are easier andmuch faster to learn than high speed typing and therefore training costsand the general costs associated with CA's are reduced appreciably. Inaddition, because revoicing is much faster than typing in most cases,voice-to-text transcription can be expedited appreciably using revoicingtechniques.

At least some prior systems have contemplated further reducing costsassociated with relay services by replacing CA's with computers runningvoice-to-text software to automatically convert HU voice messages totext. In the past there have been several problems with this solutionwhich have resulted in no one implementing a workable system. First,most voice messages (e.g., an HU's voice message) delivered over mosttelephone lines to a relay are not suitable for direct voice-to-texttranscription software. In this regard, automated transcription softwareon the market has been tuned to work well with a voice signal thatincludes a much larger spectrum of frequencies than the range used intypical phone communications. The frequency range of voice signals onphone lines is typically between 300 and 3000 Hz. Thus, automatedtranscription software does not work well with voice signals deliveredover a telephone line and large numbers of errors occur. Accuracyfurther suffers where noise exists on a telephone line which is a commonoccurrence.

Second, most automated transcription software has to be trained to thevoice of a speaker to be accurate. When a new HU calls an AU's device,there is no way for a relay to have previously trained software to theHU voice and therefore the software cannot accurately generate textusing the HU voice messages.

Third, many automated transcription software packages use context inorder to generate text from a voice message. To this end, the wordsaround each word in a voice message can be used by software as contextfor determining which word has been uttered. To use words around a firstword to identify the first word, the words around the first word have tobe obtained. For this reason, many automated transcription systems waitto present transcribed text until after subsequent words in a voicemessage have been transcribed so that context can be used to correctprior words before presentation. Systems that hold off on presentingtext to correct using subsequent context cause delay in textpresentation which is inconsistent with the relay system need for realtime or close to real time text delivery.

BRIEF SUMMARY OF THE DISCLOSURE

It has been recognized that a hybrid semi-automated system can beprovided where, when acceptable accuracy can be achieved using automatedtranscription software, the system can automatically use thetranscription software to transcribe HU voice messages to text and whenaccuracy is unacceptable, the system can patch in a human CA totranscribe voice messages to text. Here, it is believed that the numberof CAs required at a large relay facility may be reduced appreciably(e.g., 30% or more) where software can accomplish a large portion oftranscription to text. In this regard, not only is the automatedtranscription software getting better over time, in at least some casesthe software may train to an HU's voice and the vagaries associated withvoice messages received over a phone line (e.g., the limited 300 to 3000Hz range) during a first portion of a call so that during a laterportion of the call accuracy is particularly good. Training may occurwhile and in parallel with a CA manually (e.g., via typing, revoicing,etc.) transcribing voice-to-text and, once accuracy is at an acceptablethreshold level, the system may automatically delink from the CA and usethe text generated by the software to drive the AU display device.

It has been recognized that in a relay system there are at least twoprocessors that may be capable of performing automated voice recognitionprocesses and therefore that can handle the automated voice recognitionpart of a triage process involving a call assistant. To this end, inmost cases either a relay processor or an assisted user's deviceprocessor may be able to perform the automated transcription portion ofa hybrid process. For instance, in some cases an assisted user's devicewill perform automated transcription in parallel with a relay assistantgenerating call assistant generated text where the relay and assisteduser's device cooperate to provide text and assess when the callassistant should be cut out of a call with the automated text replacingthe call assistant generated text.

In other cases where a hearing user's communication device is a computeror includes a processor capable of transcribing voice messages to text,a hearing user's device may generated automated text in parallel with acall assistant generating text and the hearing user's device and therelay may cooperate to provide text and determine when the callassistant should be cut out of the call.

Regardless of which device is performing automated captioning, the callassistant generated text may be used to assess accuracy of the automatedtext for the purpose of determining when the call assistant should becut out of the call. In addition, regardless of which device isperforming automated text captioning, the call assistant generated textmay be used to train the automated voice-to-text software or engine onthe fly to expedite the process of increasing accuracy until the callassistant can be cut out of the call.

It has also been recognized that there are times when a hearing impairedperson is listening to a hearing user's voice without an assisted user'sdevice providing simultaneous text when the hearing user is confused andwould like transcription of recent voice messages of the hearing user.For instance, where an assisted user uses an assisted user's device tocarry on a non-captioned call and the assisted user has difficultyunderstanding a voice message so that the assisted user initiates acaptioning service to obtain text for subsequent voice messages. Here,while text is provided for subsequent messages, the assisted user stillcannot obtain an understanding of the voice message that promptedinitiation of captioning. As another instance, where call assistantgenerated text lags appreciably behind a current hearing user's voicemessage, an assisted user may request that the captioning catch up tothe current message.

To provide captioning of recent voice messages in these cases, in atleast some embodiments of this disclosure an assisted user's devicestores a hearing user's voice messages and, when captioning is initiatedor a catch up request is received, the recorded voice messages are usedto either automatically generate text or to have a call assistantgenerate text corresponding to the recorded voice messages.

In at least some cases when automated software is trained to a hearinguser's voice, a voice model for the hearing user that can be usedsubsequently to tune automated software to transcribe the hearing user'svoice may be stored along with a voice profile for the hearing user thatcan be used to distinguish the hearing user's voice from other hearingusers. Thereafter, when the hearing user calls an assisted user's deviceagain, the profile can be used to identify the hearing user and thevoice model can be used to tune the software so that the automatedsoftware can immediately start generating highly accurate or at leastrelatively more accurate text corresponding to the hearing user's voicemessages.

To the accomplishment of the foregoing and related ends, the disclosure,then, comprises the features hereinafter fully described. The followingdescription and the annexed drawings set forth in detail certainillustrative aspects of the disclosure. However, these aspects areindicative of but a few of the various ways in which the principles ofthe invention can be employed. Other aspects, advantages and novelfeatures of the disclosure will become apparent from the followingdetailed description of the invention when considered in conjunctionwith the drawings.

BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS

FIG. 1 is a schematic showing various components of a communicationsystem including a relay that may be used to perform various processesand methods according to at least some aspects of the present invention;

FIG. 2 is a schematic of the relay server shown in FIG. 1;

FIG. 3 is a flow chart showing a process whereby an automatedvoice-to-text engine is used to generate automated text in parallel witha call assistant generating text where the automated text is usedinstead of call assistant generated text to provide captioning anassisted user's device once an accuracy threshold has been exceeded;

FIG. 4 is a sub-process that maybe substituted for a portion of theprocess shown in FIG. 3 whereby a control assistant can determinewhether or not the automated text takes over the process after theaccuracy threshold has been achieved;

FIG. 5 is a sub-process that may be added to the process shown in FIG. 3wherein, upon an assisted user's requesting help, a call is linked to asecond call assistant for correcting the automated text;

FIG. 6 is a process whereby an automated voice-to-text engine is used tofill in text for a hearing user's voice messages that are skipped overby a call assistant when an assisted user requests instantaneouscaptioning of a current message;

FIG. 7 is a process whereby automated text is automatically used to fillin captioning when transcription by a call assistant lags behind ahearing user's voice messages by a threshold duration;

FIG. 8 is a flow chart illustrating a process whereby text is generatedfor a hearing user's voice messages that precede a request forcaptioning services;

FIG. 9 is a flow chart illustrating a process whereby voice messagesprior to a request for captioning service are automatically transcribedto text by an automated voice-to-text engine;

FIG. 10 is a flow chart illustrating a process whereby an assisteduser's device processor performs transcription processes until a requestfor captioning is received at which point the assisted user's devicepresents texts related to hearing user voice messages prior to therequest and ongoing voice messages are transcribed via a relay;

FIG. 11 is a flow chart illustrating a process whereby an assisteduser's device processor generates automated text for a hear user's voicemessages which is presented via a display to an assisted user and alsotransmits the text to a call assistant at a relay for correctionpurposes;

FIG. 12 is a flow chart illustrating a process whereby high definitiondigital voice messages and analog voice messages are handled differentlyat a relay;

FIG. 13 is a process similar to FIG. 12, albeit where an assisted useralso has the option to link to a call assistant for captioning serviceregardless of the type of voice message received;

FIG. 14 is a flow chart that may be substituted for a portion of theprocess shown in FIG. 3 whereby voice models and voice profiles aregenerated for frequent hearing user's that communicate with an assisteduser where the models and profiles can be subsequently used to increaseaccuracy of a transcription process;

FIG. 15 is a flow chart illustrating a process similar to thesub-process shown in FIG. 14 where voice profiles and voice models aregenerated and stored for subsequent use during transcription;

FIG. 16 is a flow chart illustrating a sub-process that may be added tothe process shown in FIG. 15 where the resulting process calls fortraining of a voice model at each of an assisted user's device and arelay;

FIG. 17 is a schematic illustrating a screen shot that may be presentedvia an assisted user's device display screen;

FIG. 18 is similar to FIG. 17, albeit showing a different screen shot;

FIG. 19 is a process that may be performed by the system shown in FIG. 1where automated text is generated for line check words and is presentedto an assisted user immediately upon identification of the words;

FIG. 20 is similar to FIG. 17, albeit showing a different screen shot;

FIG. 21 is a flow chart illustrating a method whereby an automatedvoice-to-text engine is used to identify errors in call assistantgenerated text which can be highlighted and can be corrected by a callassistant;

FIG. 22 is an exemplary AU device display screen shot that illustratesvisually distinct text to indicate non-textual characteristics of an HUvoice signal to an assisted user;

FIG. 23 is an exemplary CA workstation display screen shot that showshow automated AVR text associated with an instantaneously broadcast wordmay be visually distinguished for an error correcting CA;

FIG. 24 shows an exemplary HU communication device with CA captioned HUtext and AVR generated AU text presented as well as other communicationinformation that is consistent with at least some aspects off thepresent disclosure;

FIG. 25 is an exemplary CA workstation display screen shot similar toFIG. 23, albeit where a CA has corrected an error and an HU voice signalplayback has been skipped backward as a function of where the correctionoccurred;

FIG. 26 is a screen shot of an exemplary AU device display that presentsCA captioned HU text as well as AVR engine generated AU text;

FIG. 27 is an illustration of an exemplary HU device that shows textcorresponding to the HU's voice signal as well as an indication of whichword in the text has been most recently presented to an AU;

FIG. 28 is a schematic diagram showing a relay captioning system that isconsistent with at least some aspects of the present disclosure;

FIG. 29 is a schematic diagram of a relay system that includes a texttranscription quality assessment function that is consistent with atleast some aspects of the present disclosure;

FIG. 30 is similar to FIG. 29, albeit showing a different relay systemthat includes a different quality assessment function;

FIG. 31 is similar to FIG. 29, albeit showing a third relay system thatincludes a third quality assessment function;

FIG. 32 is a flow chart illustrating a method whereby time stamps areassigned to HU voice segments which are then used to substantiallysynchronize text and voice presentation;

FIG. 33 is a schematic illustrating a caption relay system that mayimplement the method illustrated in FIG. 32 as well as other methodsdescribed herein;

FIG. 34 is a sub process that may be substituted for a portion of theFIG. 32 process where an Au device assigns a sequence of time stamps toa sequence of text segments;

FIG. 35 is another flow chart illustrating another method for assigningand using time stamps to synchronize text and HU voice broadcast;

FIG. 36 is a screen shot illustrating a CA interface where a prior wordis selected to be rebroadcast;

FIG. 37 is a screen shot similar to FIG. 36, albeit of an Au devicedisplay showing an AU selecting a prior broadcast phrase forrebroadcast;

FIG. 38 is another sub process that may be substituted for a portion ofthe FIG. 32 method;

FIG. 39 is a screen shot showing a CA interface where various inventivefeatures are shown;

FIG. 40 is a screen shot illustrating another CA interface where low andhigh confidence text is presented in different columns to help a CA moreeasily distinguish between text likely to need correction and text thatis less likely to need correction;

FIG. 41 is a flow chart illustrating a method of introducing errors inASR generated text to text CA attention;

FIG. 42 is a screen shot illustrating an AU interface including, inaddition to text presentation, an HU video field and a CA signing fieldthat is consistent with at least some aspects of the present disclosure;

FIG. 43 is a screen shot illustrating yet another CA interface;

FIG. 44 is another Au interface screen shot including scrolling text andan HU video window; and

FIG. 45 is another CA interface screen shot showing a CA correctionfield, an ASR uncorrected text field and an intervening time field thatis consistent with at least some aspects of the present disclosure.

While the disclosure is susceptible to various modifications andalternative forms, specific embodiments thereof have been shown by wayof example in the drawings and are herein described in detail. It shouldbe understood, however, that the description herein of specificembodiments is not intended to limit the disclosure to the particularforms disclosed, but on the contrary, the intention is to cover allmodifications, equivalents, and alternatives falling within the spiritand scope of the disclosure as defined by the appended claims.

DETAILED DESCRIPTION OF THE DISCLOSURE

The various aspects of the subject disclosure are now described withreference to the annexed drawings, wherein like reference numeralscorrespond to similar elements throughout the several views. It shouldbe understood, however, that the drawings and detailed descriptionhereafter relating thereto are not intended to limit the claimed subjectmatter to the particular form disclosed. Rather, the intention is tocover all modifications, equivalents, and alternatives falling withinthe spirit and scope of the claimed subject matter.

As used herein, the terms “component,” “system” and the like areintended to refer to a computer-related entity, either hardware, acombination of hardware and software, software, or software inexecution. For example, a component may be, but is not limited to being,a process running on a processor, a processor, an object, an executable,a thread of execution, a program, and/or a computer. By way ofillustration, both an application running on a computer and the computercan be a component. One or more components may reside within a processand/or thread of execution and a component may be localized on onecomputer and/or distributed between two or more computers or processors.

The word “exemplary” is used herein to mean serving as an example,instance, or illustration. Any aspect or design described herein as“exemplary” is not necessarily to be construed as preferred oradvantageous over other aspects or designs.

Furthermore, the disclosed subject matter may be implemented as asystem, method, apparatus, or article of manufacture using standardprogramming and/or engineering techniques to produce software, firmware,hardware, or any combination thereof to control a computer or processorbased device to implement aspects detailed herein. The term “article ofmanufacture” (or alternatively, “computer program product”) as usedherein is intended to encompass a computer program accessible from anycomputer-readable device, carrier, or media. For example, computerreadable media can include but are not limited to magnetic storagedevices (e.g., hard disk, floppy disk, magnetic strips . . . ), opticaldisks (e.g., compact disk (CD), digital versatile disk (DVD) . . . ),smart cards, and flash memory devices (e.g., card, stick). Additionallyit should be appreciated that a carrier wave can be employed to carrycomputer-readable electronic data such as those used in transmitting andreceiving electronic mail or in accessing a network such as the Internetor a local area network (LAN). Of course, those skilled in the art willrecognize many modifications may be made to this configuration withoutdeparting from the scope or spirit of the claimed subject matter.

Referring now to the drawings wherein like reference numerals correspondto similar elements throughout the several views and, more specifically,referring to FIG. 1, the present disclosure will be described in thecontext of an exemplary communication system 10 including an assisteduser's (AU's) communication device 12, a hearing user's (HU's) telephoneor other type communication device 14, and a relay 16. The AU's device12 is linked to the HU's device 14 via any network connection capable offacilitating a voice call between the AU and the HU. For instance, thelink may be a conventional telephone line, a network connection such asan internet connection or other network connection, a wirelessconnection, etc. AU device 12 includes a keyboard 20, a display screen18 and a handset 22. Keyboard 20 can be used to dial any telephonenumber to initiate a call and, in at least some cases, includes otherkeys or may be controlled to present virtual buttons via screen 18 forcontrolling various functions that will be described in greater detailbelow. Other identifiers such as IP addresses or the like may also beused in at least some cases to initiate a call. Screen 18 includes aflat panel display screen for displaying, among other things, texttranscribed from a voice message or signal generated using HU's device14, control icons or buttons, caption feedback signals, etc. Handset 22includes a speaker for broadcasting a hearing user's voice messages toan assisted user and a microphone for receiving a voice message from anassisted user for delivery to the hearing user's device 14. Assisteduser device 12 may also include a second loud speaker so that device 12can operate as a speaker phone type device. Although not shown, device12 further includes a processor and a memory for storing software run bythe processor to perform various functions that are consistent with atleast some aspects of the present disclosure. Device 12 is also linkedor is linkable to relay 16 via any communication network including aphone network, a wireless network, the internet or some other similarnetwork, etc.

Hearing user's device 14, in at least some embodiments, includes acommunication device (e.g., a telephone) including a keyboard fordialing phone numbers and a handset including a speaker and a microphonefor communication with other devices. In other embodiments device 14 mayinclude a computer, a smart phone, a smart tablet, etc., that canfacilitate audio communications with other devices. Devices 12 and 14may use any of several different communication protocols includinganalog or digital protocols, a VOIP protocol or others.

Referring still to FIG. 1, relay 16 includes, among other things, arelay server 30 and a plurality of call assistant work stations 32, 34,etc. Each of the call assistant work stations 32, 34, etc., is similarand operates in a similar fashion and therefore only station 32 isdescribed here in any detail. Station 32 includes a display screen 50, akeyboard 52 and a headphone/microphone headset 54. Screen 50 may be anytype of electronic display screen for presenting information includingtext transcribed from a hearing user's voice signal or message. In mostcases screen 50 will present a graphical user interface with on screentools for editing text that appears on the screen. One text editingsystem is described in U.S. Pat. No. 7,164,753 which issued on Jan. 16,2007 which is titled “Real Time Transcription Correction System” andwhich is incorporated herein in its entirety.

Keyboard 52 is a standard text entry QUERTY type keyboard and can beused to type text or to correct text presented on displays screen 50.Headset 54 includes a speaker in an ear piece and a microphone in amouth piece and is worn by a call assistant. The headset enables a callassistant to listen to the voice of a hearing user and the microphoneenables the call assistant to speak voice messages into the relay systemsuch as, for instance, revoiced messages from a hearing user to betranscribed into text. For instance, typically during a call between ahearing user on device 14 and an assisted user on device 12, the hearinguser's voice messages are presented to a call assistant via headset 54and the call assistant revoices the messages into the relay system usingheadset 54. Software trained to the voice of the call assistanttranscribes the assistant's voice messages into text which is presentedon display screen 50. The call assistant then uses keyboard 52 and/orheadset 54 to make corrections to the text on display 50. The correctedtext is then transmitted to the assisted user's device 12 for display onscreen 18. In the alternative, the text may be transmitted prior tocorrection to the assisted user's device 12 for display and correctionsmay be subsequently transmitted to correct the displayed text viain-line corrections where errors are replaced by corrected text.

Although not shown, call assistant work station 32 may also include afoot pedal or other device for controlling the speed with which voicemessages are played via headset 54 so that the call assistant can slowor even stop play of the messages while the assistant either catches upon transcription or correction of text.

Referring still to FIG. 1 and also to FIG. 2, server 30 is a computersystem that includes, among other components, at least a first processor56 linked to a memory or database 58 where software run by server 56 tofacilitate various functions that are consistent with at least someaspects of the present disclosure is stored. The software stored inmemory 58 includes pre-trained call assistant voice-to-texttranscription software 60 for each call assistant where call assistantspecific software is trained to the voice of an associated callassistant thereby increasing the accuracy of transcription activities.For instance, Naturally Speaking continuous speech recognition softwareby Dragon, Inc. may be pre-trained to the voice of a specific callassistant and then used to transcribe voice messages voiced by the callassistant into text.

In addition to the call assistant trained software, a voice-to-textsoftware program 62 that is not pre-trained to a CA's voice and insteadthat trains to any voice on the fly as voice messages are received isstored in memory 58. Again, Naturally Speaking software that can trainon the fly may be used for this purpose. Hereinafter, the automaticvoice recognition software or system that trains to the HU voices willbe referred to generally as an AVR engine at times.

Moreover, software 64 that automatically performs one of severaldifferent types of triage processes to generate text from voice messagesaccurately, quickly and in a relatively cost effective manner is storedin memory 58. The triage programs are described in detail hereafter.

One issue with existing relay systems is that each call is relativelyexpensive to facilitate. To this end, in order to meet required accuracystandards for text caption calls, each call requires a dedicated callassistant. While automated voice-to-text systems that would not requirea call assistant have been contemplated, none has been implementedbecause of accuracy and speed problems.

One aspect of the present disclosure is related to a system that issemi-automated wherein a call assistant is used when accuracy of anautomated system is not at required levels and the assistant is cut outof a call automatically or manually when accuracy of the automatedsystem meets or exceeds accuracy standards or at the preference of anAU. For instance, in at least some cases a call assistant will beassigned to every new call linked to a relay and the call assistant willtranscribe voice-to-text as in an existing system. Here, however, thedifference will be that, during the call, the voice of a hearing userwill also be processed by server 30 to automatically transcribe thehearing user's voice messages to text (e.g., into “automated text”).Server 30 compares corrected text generated by the call assistant to theautomated text to identify errors in the automated text. Server 30 usesidentified errors to train the automated voice-to-text software to thevoice of the hearing user. During the beginning of the call the softwaretrains to the hearing user's voice and accuracy increases over time asthe software trains. At some point the accuracy increases until requiredaccuracy standards are met. Once accuracy standards are met, server 30is programmed to automatically cut out the call assistant and starttransmitting the automated text to the assisted user's device 12.

In at least some cases, when a call assistant is cut out of a call, thesystem may provide a “Help” button, an “Assist” button or “AssistanceRequest” type button (see 68 in FIG. 1) to an assisted user so that, ifthe assisted user recognizes that the automated text has too many errorsfor some reason, the assisted user can request a link to a callassistant to increase transcription accuracy (e.g., generate anassistance request). In some cases the help button may be a persistentmechanical button on the assisted user's device 12. In the alternative,the help button may be a virtual on screen icon (e.g., see 68 in FIG. 1)and screen 18 may be a touch sensitive screen so that contact with thevirtual button can be sensed. Where the help button is virtual, thebutton may only be presented after the system switches from providingcall assistant generated text to an assisted user's device to providingautomated text to the assisted user's device to avoid confusion (e.g.,avoid a case where an assisted user is already receiving call assistantgenerated text but thinks, because of a help button, that even betteraccuracy can be achieved in some fashion). Thus, while call assistantgenerated text is displayed on an assisted user's device 12, no “help”button is presented and after automated text is presented, the “help”button is presented. After the help button is selected and a callassistant is re-linked to the call, the help button is again removedfrom the assisted user's device display 18 to avoid confusion.

Referring now to FIGS. 2 and 3, a method or process 70 is illustratedthat may be performed by server 30 to cut out a call assistant whenautomated text reaches an accuracy level that meets a standard thresholdlevel. Referring also to FIG. 1, at block 72, help and auto flags areeach set to a zero value. The help flag indicates that an assisted userhas selected a help or assist button via the assisted user's device 12because of a perception that too many errors are occurring intranscribed text. The auto flag indicates that automated text accuracyhas exceeded a standard threshold requirement. Zero values indicate thatthe help button has not been selected and that the standard requirementhas yet to be met and one values indicate that the button has beenselected and that the standard requirement has been met.

Referring still to FIGS. 1 and 3, at block 74, during a phone callbetween a hearing user using device 14 and an assisted user using device12, the hearing user's voice messages are transmitted to server 30 atrelay 16. Upon receiving the hearing user's voice messages, server 30checks the auto and help flags at blocks 76 and 84, respectively. Atleast initially the auto flag will be set to zero at block 76 meaningthat automated text has not reached the accuracy standard requirementand therefore control passes down to block 78 where the hearing user'svoice messages are provided to a call assistant. At block 80, the callassistant listens to the hearing user's voice messages and generatestext corresponding thereto by either typing the messages, revoicing themessages to voice-to-text transcription software trained to the callassistant's voice, or a combination of both. Text generated is presentedon screen 50 and the call assistant makes corrections to the text usingkeyboard 52 and/or headset 54 at block 80. At block 82 the callassistant generated text is transmitted to assisted user device 12 to bedisplayed for the assisted user on screen 18.

Referring again to FIGS. 1 and 3, at block 84, at least initially thehelp flag will be set to zero indicating that the assisted user has notrequested additional captioning assistance. In fact, at least initiallythe “help” button 68 may not be presented to an assisted user as callassistant generated text is initially presented. Where the help flag iszero at block 84, control passes to block 86 where the hearing user'svoice messages are fed to voice-to-text software run by server 30 thathas not been previously trained to any particular voice. At block 88 thesoftware automatically converts the hearing user's voice-to-textgenerating automated text. At block 90, server 30 compares the callassistant generated text to the automated text to identify errors in theautomated text. At block 92, server 30 uses the errors to train thevoice-to-text software for the hearing user's voice. In this regard, forinstance, where an error is identified, server 30 modifies the softwareso that the next time the utterance that resulted in the error occurs,the software will generate the word or words that the call assistantgenerated for the utterance. Other ways of altering or training thevoice-to-text software are well known in the art and any way of trainingthe software may be used at block 92.

After block 92 control passes to block 94 where server 30 monitors for aselection of the “help” button 68 by the assisted user. If the helpbutton has not been selected, control passes to block 96 where server 30compares the accuracy of the automated text to a threshold standardaccuracy requirement. For instance, the standard requirement may requirethat accuracy be great than 96% measured over at least a most recentforty-five second period or a most recent 100 words uttered by a hearinguser, whichever is longer. Where accuracy is below the thresholdrequirement, control passes back up to block 74 where the processdescribed above continues. At block 96, once the accuracy is greaterthan the threshold requirement, control passes to block 98 where theauto flag is set to one indicating that the system should start usingthe automated text and delink the call assistant from the call to freeup the assistant to handle a different call. A virtual “help” button mayalso be presented via the assisted user's display 18 at this time. Next,at block 100, the call assistant is delinked from the call and at block102 the processor generated automated text is transmitted to the AUdevice to be presented on display screen 18.

Referring again to block 74, the hearing user's voice is continuallyreceived during a call and at block 76, once the auto flag has been setto one, the lower portion of the left hand loop including blocks 78, 80and 82 is cut out of the process as control loops back up to block 74.

Referring again to block 94, if, during an automated portion of a callwhen automated text is being presented to the assisted user, theassisted user decides that there are too many errors in thetranscription presented via display 18 and the assisted user selects the“help” button 68 (see again FIG. 1), control passes to block 104 wherethe help flag is set to one indicating that the assisted user hasrequested the assistance of a call assistant and the auto flag is resetto zero indicating that call assistant generated text will be used todrive the assisted user's display 18 instead of the automated text.Thereafter control passes back up to block 74. Again, at block 76, withthe auto flag set to zero the next time through decision block 76,control passes back down to block 78 where the call is again linked to acall assistant for transcription as described above. In addition, thenext time through block 84, because the help flag is set to one, controlpasses back up to block 74 and the automated text loop including blocks86 through 104 is effectively cut out of the rest of the call.

In at least some embodiments, there will be a short delay (e.g., 5 to 10seconds in most cases) between setting the flags at block 104 andstopping use of the automated text so that a new call assistant can belinked up to the call and start generating call assistant generated textprior to halting the automated text. In these cases, until the callassistant is linked and generating text for at least a few seconds(e.g., 3 seconds), the automated text will still be used to drive theassisted user's display 18. The delay may either be a pre-defined delayor may have a case specific duration that is determined by server 30monitoring call assistant generated text and switching over to the callassistant generated text once the call assistant is up to speed.

In some embodiments, prior to delinking a call assistant from a call atblock 100, server 30 may store a call assistant identifier along with acall identifier for the call. Thereafter, if an assisted user requestshelp at block 94, server 30 may be programmed to identify if the callassistant previously associated with the call is available (e.g. nothandling another call) and, if so, may re-link to the call assistant atblock 78. In this manner, if possible, a call assistant that has atleast some context for the call can be linked up to restarttranscription services.

In some embodiments it is contemplated that after an assisted user hasselected a help button to receive call assistance, the call will becompleted with a call assistant on the line. In other cases it iscontemplated that server 30 may, when a call assistant is re-linked to acall, start a second triage process to attempt to delink the callassistant a second time if a threshold accuracy level is again achieved.For instance, in some cases, midstream during a call, a second hearinguser may start communicating with the assisted user via the hearinguser's device. For instance, a child may yield the hearing user's device14 to a grandchild that has a different voice profile causing theassisted user to request help from a call assistant because of perceivedtext errors. Here, after the hand back to the call assistant, server 30may start training on the grandchild's voice and may eventually achievethe threshold level required. Once the threshold again occurs, the callassistant may be delinked a second time so that automated text is againfed to the assisted user's device.

As another example text errors in automated text may be caused bytemporary noise in one or more of the lines carrying the hearing user'svoice messages to relay 16. Here, once the noise clears up, automatedtext may again be a suitable option. Thus, here, after an assisted userrequests call assistant help, the triage process may again commence andif the threshold accuracy level is again exceeded, the call assistantmay be delinked and the automated text may again be used to drive theassisted user's device 12. While the threshold accuracy level may be thesame each time through the triage process, in at least some embodimentsthe accuracy level may be changed each time through the process. Forinstance, the first time through the triage process the accuracythreshold may be 96%. The second time through the triage process theaccuracy threshold may be raised to 98%.

In at least some embodiments, when the automated text accuracy exceedsthe standard accuracy threshold, there may be a short transition timeduring which a call assistant on a call observes automated text whilelistening to a hearing user's voice message to manually confirm that thehandover from call assistant generated text to automated text is smooth.During this short transition time, for instance, the call assistant maywatch the automated text on her workstation screen 50 and may correctany errors that occur during the transition. In at least some cases, ifthe call assistant perceives that the handoff does not work or thequality of the automated text is poor for some reason, the callassistant may opt to retake control of the transcription process.

One sub-process 120 that may be added to the process shown in FIG. 3 formanaging a call assistant to automated text handoff is illustrated inFIG. 4. Referring also to FIGS. 1 and 2, at block 96 in FIG. 3, if theaccuracy of the automated text exceeds the accuracy standard thresholdlevel, control may pass to block 122 in FIG. 4. At block 122, a shortduration transition timer (e.g. 10-15 seconds) is started. At block 124automated text (e.g., text generated by feeding the hearing user's voicemessages directly to voice-to-text software) is presented on the callassistant's display 50. At block 126 an on screen “Retain Control” iconor virtual button is provided to the call assistant via the assistant'sdisplay screen 50 which can be selected by the call assistant to foregothe handoff to the automated voice-to-text software. At block 128, ifthe “Retain Control” icon is selected, control passes to block 132 wherethe help flag is set to one and then control passes back up to block 76in FIG. 3 where the call assistant process for generating text continuesas described above. At block 128, if the call assistant does not selectthe “Retain Control” icon, control passes to block 130 where thetransition timer is checked. If the transition timer has not timed outcontrol passes back up to block 124. Once the timer times out at block130, control passes back to block 98 in FIG. 3 where the auto flag isset to one and the call assistant is delinked from the call.

In at least some embodiments it is contemplated that after voice-to-textsoftware takes over the transcription task and the call assistant isdelinked from a call, server 30 itself may be programmed to sense whentranscription accuracy has degraded substantially and the server 30 maycause a re-link to a call assistant to increase accuracy of the texttranscription. For instance, server 30 may assign a confidence factor toeach word in the automated text based on how confident the server isthat the word has been accurately transcribed. The confidence factorsover a most recent number of words (e.g., 100) or a most recent period(e.g., 45 seconds) may be averaged and the average used to assess anoverall confidence factor for transcription accuracy. Where theconfidence factor is below a threshold level, server 30 may re-link to acall assistant to increase transcription accuracy. The automated processfor re-linking to a call assistant may be used instead of or in additionto the process described above whereby an assisted user selects the“help” button to re-link to a call assistant.

In at least some cases when an assisted user selects a “help” button tore-link to a call assistant, partial call assistance may be providedinstead of full call assistant service. For instance, instead of addinga call assistant that transcribes a hearing user's voice messages andthen corrects errors, a call assistant may be linked only for correctionpurposes. The idea here is that while software trained to a hearinguser's voice may generate some errors, the number of errors aftertraining will still be relatively small in most cases even ifobjectionable to an assisted user. In at least some cases callassistants may be trained to have different skill sets where highlyskilled and relatively more expensive to retain call assistants aretrained to re-voice hearing user voice messages and correct theresulting text and less skilled call assistants are trained to simplymake corrections to automated text. Here, initially all calls may berouted to highly skilled revoicing or “transcribing” call assistants andall re-linked calls may be routed to less skilled “corrector” callassistants.

A sub-process 134 that may be added to the process of FIG. 3 for routingre-linked calls to a corrector call assistant is shown in FIG. 5.Referring also to FIGS. 1 and 3, at decision block 94, if an assisteduser selects the help button, control may pass to block 136 in FIG. 3where the call is linked to a second corrector call assistant. At block138 the automated text is presented to the second call assistant via thecall assistant's display 50. At block 140 the second call assistantlistens to the voice of the hearing user and observes the automated textand makes corrections to errors perceived in the text. At block 142,server 30 transmits the corrected automated text to the assisted user'sdevice for display via screen 18. After block 142 control passes back upto block 76 in FIG. 2.

In some cases where a call assistant generates text that drives anassisted user's display screen 18 (see again FIG. 1), for one reason oranother the call assistant's transcription to text may fall behind thehearing user's voice message stream by a substantial amount. Forinstance, where a hearing user is speaking quickly, is using oddvocabulary, and/or has an unusual accent that is hard to understand,call assistant transcription may fall behind a voice message stream by20 seconds, 40 seconds or more.

In many cases when captioning falls behind, an assisted user canperceive that presented text has fallen far behind broadcast voicemessages from a hearing user based on memory of recently broadcast voicemessage content and observed text. For instance, an assisted user mayrecognize that currently displayed text corresponds to a portion of thebroadcast voice message that occurred thirty seconds ago. In other casessome captioning delay indicator may be presented via an assisted user'sdevice display 18. For instance, see FIG. 17 where captioning delay isindicated in two different ways on a display screen 18. First, text 212indicates an estimated delay in seconds (e.g., 24 second delay). Second,at the end of already transcribed text 214, blanks 216 for words alreadyvoiced but yet to be transcribed may be presented to give an assisteduser a sense of how delayed the captioning process has become.

When an assisted user perceives that captioning is too far behind orwhen the user cannot understand a recently broadcast voice message, theassisted user may want the text captioning to skip ahead to thecurrently broadcast voice message. For instance, if an assisted user haddifficulty hearing the most recent five seconds of a hearing user'svoice message and continues to have difficulty hearing but generallyunderstood the preceding 25 seconds, the assisted user may want thecaptioning process to be re-synced with the current hearing user's voicemessage so that the assisted user's understanding of current words isaccurate.

Here, however, because the assisted user could not understand the mostrecent 5 seconds of broadcast voice message, a re-sync with the currentvoice message would leave the assisted user with at least some void inunderstanding the conversation (e.g., at least the most recent 5 secondsof misunderstood voice message would be lost). To deal with this issue,in at least some embodiments, it is contemplated that server 30 may runautomated voice-to-text software on a hearing user's voice messagesimultaneously with a call assistant generating text from the voicemessage and, when an assisted user requests a “catch-up” or “re-sync” ofthe transcription process to the current voice message, server 30 mayprovide “fill in” automated text corresponding to the portion of thevoice message between the most recent call assistant generated text andthe instantaneous voice message which may be provided to the assisteduser's device for display and also, optionally, to the call assistant'sdisplay screen to maintain context for the call assistant. In this case,while the fill in automated text may have some errors, the fill in textwill be better than no text for the associated period and can bereferred to by the assisted user to better understand the voicemessages.

In cases where the fill in text is presented on the call assistant'sdisplay screen, the call assistant may correct any errors in the fill intext. This correction and any error correction by a call assistant forthat matter may be made prior to transmitting text to the assisteduser's device or subsequent thereto. Where corrected text is transmittedto an assisted user's device subsequent to transmission of the originalerror prone text, the assisted user's device corrects the errors byreplacing the erroneous text with the corrected text.

Because it is often the case that assisted users will request a re-synconly when they have difficulty understanding words, server 30 may onlypresent automated fill in text to an assisted user corresponding to apre-defined duration period (e.g., 8 seconds) that precedes the timewhen the re-sync request occurs. For instance, consistent with theexample above where call assistant captioning falls behind by thirtyseconds, an assisted user may only request re-sync at the end of themost recent five seconds as inability to understand the voice messagemay only be an issue during those five seconds. By presenting the mostrecent eight seconds of automated text to the assisted user, the userwill have the chance to read text corresponding to the misunderstoodvoice message without being inundated with a large segment of automatedtext to view. Where automated fill in text is provided to an assisteduser for only a pre-defined duration period, the same text may beprovided for correction to the call assistant.

Referring now to FIG. 7, a method 190 by which an assisted user requestsa re-sync of the transcription process to current voice messages whencall assistant generated text falls behind current voice messages isillustrated. Referring also to FIG. 1, at block 192 a hearing user'svoice messages are received at relay 16. After block 192, control passesdown to each of blocks 194 and 200 where two simultaneous sub-processesoccur in parallel. At block 194, the hearing user's voice messages arestored in a rolling buffer. The rolling buffer may, for instance, have atwo minute duration so that the most recent two minutes of a hearinguser's voice messages are always stored. At block 196, a call assistantlistens to the hearing user's voice message and transcribes textcorresponding to the messages via re-voicing to software trained to thecall assistant's voice, typing, etc. At block 198 the call assistantgenerated text is transmitted to assisted user's device 12 to bepresented on display screen 18 after which control passes back up toblock 192. Text correction may occur at block 196 or after block 198.

Referring again to FIG. 7, at process block 200, the hearing user'svoice is fed directly to voice-to-text software run by server 30 whichgenerates automated text at block 202. Although not shown in FIG. 7,after block 202, server 30 may compare the automated text to the callassistant generated text to identify errors and may use those errors totrain the software to the hearing user's voice so that the automatedtext continues to get more accurate as a call proceeds.

Referring still to FIGS. 1 and 7, at decision block 204, controller 30monitors for a catch up or re-sync command received via the assisteduser's device 12 (e.g., via selection of an on-screen virtual “catch up”button 220, see again FIG. 17). Where no catch up or re-sync command hasbeen received, control passes back up to block 192 where the processdescribed above continues to cycle. At block 204, once a re-sync commandhas been received, control passes to block 206 where the buffered voicemessages are skipped and a current voice message is presented to the earof the call assistant to be transcribed. At block 208 the automated textcorresponding to the skipped voice message segment is filled in to thetext on the call assistant's screen for context and at block 210 thefill in text is transmitted to the assisted user's device for display.

Where automated text is filled in upon the occurrence of a catch upprocess, the fill in text may be visually distinguished on the assisteduser's screen and/or on the call assistant's screen. For instance, fillin text may be highlighted, underlined, bolded, shown in a distinctfont, etc. For example, see FIG. 18 that shows fill in text 222 that isunderlined to visually distinguish. See also that the captioning delay212 has been updated. In some cases, fill in text corresponding to voicemessages that occur after or within some pre-defined period prior to are-sync request may be distinguished in yet a third way to point out thetext corresponding to the portion of a voice message that the assisteduser most likely found interesting (e.g., the portion that promptedselection of the re-sync button). For instance, where 24 previousseconds of text are filled in when a re-sync request is initiated, all24 seconds of fill in text may be underlined and the 8 seconds of textprior to the re-sync request may also be highlighted in yellow. See inFIG. 18 that some of the fill in text is shown in a phantom box 226 toindicate highlighting.

In at least some cases it is contemplated that server 30 may beprogrammed to automatically determine when call assistant generated textsubstantially lags a current voice message from a hearing user andserver 30 may automatically skip ahead to re-sync a call assistant witha current message while providing automated fill in text correspondingto intervening voice messages. For instance, server 30 may recognizewhen call assistant generated text is more than thirty seconds behind acurrent voice message and may skip the voice messages ahead to thecurrent message while filling in automated text to fill the gap. In atleast some cases this automated skip ahead process may only occur afterat least some (e.g., 2 minutes) training to a hearing user's voice soensure that minimal errors are generated in the fill in text.

A method 150 for automatically skipping to a current voice message in abuffer when a call assistant falls to far behind is shown in FIG. 6.Referring also to FIG. 1, at block 152, a hearing user's voice messagesare received at relay 16. After block 152, control passes down to eachof blocks 154 and 162 where two simultaneous sub-processes occur inparallel. At block 154, the hearing user's voice messages are stored ina rolling buffer. At block 156, a call assistant listens to the hearinguser's voice message and transcribes text corresponding to the messagesvia re-voicing to software trained to the call assistant's voice,typing, etc., after which control passes to block 170.

Referring still to FIG. 6, at process block 162, the hearing user'svoice is fed directly to voice-to-text software run by server 30 whichgenerates automated text at block 164. Although not shown in FIG. 6,after block 164, server 30 may compare the automated text to the callassistant generated text to identify errors and may use those errors totrain the software to the hearing user's voice so that the automatedtext continues to get more accurate as a call proceeds.

Referring still to FIGS. 1 and 6, at decision block 166, controller 30monitors how far call assistant text transcription is behind the currentvoice message and compares that value to a threshold value. If the delayis less than the threshold value, control passes down to block 170. Ifthe delay exceeds the threshold value, control passes to block 168 whereserver 30 uses automated text from block 164 to fill in the callassistant generated text and skips the call assistant up to the currentvoice message. After block 168 control passes to block 170. At block170, the text including the call assistant generated text and the fillin text is presented to the call assistant via display screen 50 and thecall assistant makes any corrections to observed errors. At block 172,the text is transmitted to assisted user's device 12 and is displayed onscreen 18. Again, uncorrected text may be transmitted to and displayedon device 12 and corrected text may be subsequently transmitted and usedto correct errors in the prior text in line on device 12. After block172 control passes back up to block 152 where the process describedabove continues to cycle. Automatically generated text to fill in whenskipping forward may be visually distinguished (e.g., highlighted,underlined, etc.)

In at least some cases when automated fill in text is generated, thattext may not be presented to the call assistant or the assisted user asa single block and instead may be doled out at a higher speed than thetalking speed of the hearing user until the text catches up with acurrent time. To this end, where transcription is far behind a currentpoint in a conversation, if automated catch up text were generated as animmediate single block, in at least some cases, the earliest text in theblock could shoot off a call assistant's display screen or an assisteduser's display screen so that the call assistant or the assisted userwould be unable to view all of the automated catch up text. Instead ofpresenting the automated text as a complete block upon catchup, theautomated catch up text may be presented at a rate that is faster (e.g.,two to three times faster) than the hearing user's rate of speaking sothat catch up is rapid without the oldest catch up text running off thecall assistant's or assisted user's displays.

In other cases, when an assisted user requests fill in, the system mayautomatically fill in text and only present the most recent 10 secondsor so of the automatic fill in text to the CA for correction so that theassisted user has corrected text corresponding to a most recent periodas quickly as possible. In many cases where the CA generated text issubstantially delayed, much of the fill in text would run off a typicalassisted user's device display screen when presented so makingcorrections to that text would make little sense as the assisted userthat requests catch up text is typically most interested in textassociated with the most recent HU voice signal.

Many assisted user's devices can be used as conventional telephoneswithout captioning service or as assisted user devices where captioningis presented and voice messages are broadcast to an assisted user. Theidea here is that one device can be used by hearing impaired persons andpersons that have no hearing impairment and that the overall costsassociated with providing captioning service can be minimized by onlyusing captioning when necessary. In many cases even a hearing impairedperson may not need captioning service all of the time. For instance, ahearing impaired person may be able to hear the voice of a person thatspeaks loudly fairly well but may not be able to hear the voice ofanother person that speaks more softly. In this case, captioning wouldbe required when speaking to the person with the soft voice but may notbe required when speaking to the person with the loud voice. As anotherinstance, an impaired person may hear better when well rested but hearrelatively more poorly when tired so captioning is required only whenthe person is tired. As still another instance, an impaired person mayhear well when there is minimal noise on a line but may hear poorly ifline noise exceeds some threshold. Again, the impaired person would onlyneed captioning some of the time.

To minimize captioning service costs and still enable an impaired personto obtain captioning service whenever needed and even during an ongoingcall, some systems start out all calls with a default setting where anassisted user's device 12 is used like a normal telephone withoutcaptioning. At any time during an ongoing call, an assisted user canselect either a mechanical or virtual “Caption” icon or button (seeagain 68 in FIG. 1) to link the call to a relay, provide a hearinguser's voice messages to the relay and commence captioning service. Oneproblem with starting captioning only after an assisted user experiencesproblems hearing words is that at least some words (e.g., words thatprompted the assisted user to select the caption button in the firstplace) typically go unrecognized and therefore the assisted user is leftwith a void in their understanding of a conversation.

One solution to the problem of lost meaning when words are notunderstood just prior to selection of a caption button is to store arolling recordation of a hearing user's voice messages that can betranscribed subsequently when the caption button is selected to generate“fill in” text. For instance, the most recent 20 seconds of a hearinguser's voice messages may be recorded and then transcribed only if thecaption button is selected. The relay generates text for the recordedmessage either automatically via software or via revoicing or typing bya call assistant or via a combination of both. In addition, the callassistant or the automated voice recognition software startstranscribing current voice messages. The text from the recording and thereal time messages is transmitted to and presented via assisted user'sdevice 12 which should enable the assisted user to determine the meaningof the previously misunderstood words. In at least some embodiments therolling recordation of hearing user's voice messages may be maintainedby the assisted user's device 12 (see again FIG. 1) and that recordationmay be sent to the relay for immediate transcription upon selection ofthe caption button.

Referring now to FIG. 8, a process 230 that may be performed by thesystem of FIG. 1 to provide captioning for voice messages that occurprior to a request for captioning service is illustrated. Referring alsoto FIG. 1, at block 232 a hearing user's voice messages are receivedduring a call with an assisted user at the assisted user's device 12. Atblock 234 the assisted user's device 12 stores a most recent 20 secondsof the hearing user's voice messages on a rolling basis. The 20 secondsof voice messages are stored without captioning initially in at leastsome embodiments. At decision block 236, the assisted user's devicemonitors for selection of a captioning button (not shown). If thecaptioning button has not been selected, control passes back up to block232 where blocks 232, 234 and 236 continue to cycle.

Once the caption button has been selected, control passes to block 238where assisted user's device 12 establishes a communication link torelay 16. At block 240 assisted user's device 12 transmits the stored 20seconds of the hearing user's voice messages along with current ongoingvoice messages from the hearing user to relay 16. At this point a callassistant and/or software at the relay transcribes the voice-to-text,corrections are made (or not), and the text is transmitted back todevice 12 to be displayed. At block 242 assisted user's device 12receives the captioned text from the relay 16 and at block 244 thereceived text is displayed or presented on the assisted user's devicedisplay 18. At block 246, in at least some embodiments, textcorresponding to the 20 seconds of hearing user voice messages prior toselection of the caption button may be visually distinguished (e.g.,highlighted, bolded, underlined, etc.) from other text in some fashion.After block 246 control passes back up to block 232 where the processdescribed above continues to cycle and captioning in substantially realtime continues.

Referring to FIG. 9, a relay server process 270 whereby automatedsoftware transcribes voice messages that occur prior to selection of acaption button and a call assistant at least initially captions currentvoice messages is illustrated. At block 272, after an assisted userrequests captioning service by selecting a caption button, server 30receives a hearing user's voice messages including current ongoingmessages as well as the most recent 20 seconds of voice messages thathad been stored by assisted user's device 12 (see again FIG. 1). Afterblock 27, control passes to each of blocks 274 and 278 where twosimultaneous processes commence in parallel. At block 274 the stored 20seconds of voice messages are provided to voice-to-text software run byserver 30 to generate automated text and at block 276 the automated textis transmitted to the assisted user's device 12 for display. At block278 the current or real time hearing user's voice messages are providedto a call assistant and at block 280 the call assistant transcribes thecurrent voice messages to text. The call assistant generated text istransmitted to an assisted user's device at block 282 where the text isdisplayed along with the text transmitted at block 276. Thus, here, theassisted user receives text corresponding to misunderstood voicemessages that occur just prior to the assisted user requestingcaptioning. One other advantage of this system is that when captioningstarts, the call assistant is not starting captioning with an alreadyexisting backlog of words to transcribe and instead automated softwareis used to provide the prior text.

In addition to using a service provided by relay 16 to transcribe storedrolling text, other resources may be used to transcribe the storedrolling text. For instance, in at least some embodiments an assisteduser's device may link via the Internet or the like to a third partyprovider that can receive voice messages and transcribe those messages,at least somewhat accurately, to text. In these cases it is contemplatedthat real time transcription where accuracy needs to meet a highaccuracy standard would still be performed by a call assistant orsoftware trained to a specific voice while less accuracy sensitive textmay be generated by the third party provider, at least some of the timefor free, and transmitted back to the assisted user's device fordisplay.

In other cases, it is contemplated that the assisted user's device 12itself may run voice-to-text software that could be used to at leastsomewhat accurately transcribe voice messages to text where the textgenerated by the assisted user's device would only be provided in caseswhere accuracy sensitivity is less than normal such as where rollingvoice messages prior to selection of a caption icon to initiatecaptioning are to be transcribed.

FIG. 10 shows another method 300 for providing text for voice messagesthat occurred prior to a caption request, albeit where an assisteduser's device generates the pre-request text as opposed to a relay.Referring also to FIG. 1, at block 310 a hearing user's voice messagesare received at an assisted user's device 12. At block 312, the assisteduser's device 12 runs voice-to-text software that, in at least someembodiments, trains on the fly to the voice of a linked hearing user andgenerates caption text.

Here, on the fly training may include assigning a confidence factor toeach automatically transcribed word and only using text that has a highconfidence factor to train a voice model for the hearing user. Forinstance, only text having a confidence factor greater than 95% may beused for automatic training purposes. Here, confidence factors may beassigned based on many different factors or algorithms, many of whichare well known in the automatic voice recognition art. In thisembodiment, at least initially, the caption text generated by theassisted user's device 12 is not displayed to the assisted user. Atblock 314, until the assisted user requests captioning, control simplyroutes back up to block 310. Once captioning is requested by an assisteduser, control passes to block 316 where the text corresponding to thelast 20 seconds generated by the assisted user's device is presented onthe assisted user's device display 18. Here, while there may be someerrors in the displayed text, at least some text associated with themost recent voice message can be quickly presented and give the assisteduser the opportunity to attempt to understand the voice messagesassociated therewith. At block 318 the assisted user's device links to arelay and at block 320 the hearing user's ongoing voice messages aretransmitted to the relay. At block 322, after call assistanttranscription at the relay, the assisted user's device receives thetranscribed text from the relay and at block 324 the text is displayed.After block 324 control passes back up to block 320 where the sub-loopincluding blocks 320, 322 and 324 continues to cycle.

Thus, in the above example, instead of the assisted user's devicestoring the last 20 seconds of a hearing user's voice signal andtranscribing that voice signal to text after the assisted user requeststranscription, the assisted user's device constantly runs an ASR enginebehind the scenes to generate automated engine text which is storedwithout initially being presented to the assisted user. Then, when theassisted user requests captioning or transcription, the most recentlytranscribed text can be presented via the assisted user's device displayimmediately or via rapid presentation (e.g., sequentially at a speedhigher than the hearing user's speaking speed).

In at least some cases it is contemplated that voice-to-text softwarerun outside control of the relay may be used to generate at leastinitial text for a hearing user's voice and that the initial text may bepresented via an assisted user's device. Here, because known softwarestill may generate more text transcription errors than allowed givenstandard accuracy requirements, a relay correction service may beprovided. For instance, in addition to presenting text transcribed bythe assisted user's device via a device display 18, the text transcribedby the assisted user's device may also be transmitted to a relay 16 forcorrection. In addition to transmitting the text to the relay, thehearing user's voice messages may also be transmitted to the relay sothat a call assistant can compare the text automatically generated bythe assisted user's device to the HU's voice messages. At the relay, thecall assistant can listen to the voice of the hearing person and canobserve associated text. Any errors in the text can be corrected andcorrected text blocks can be transmitted back to the assisted user'sdevice and used for in line correction on the assisted user's displayscreen.

One advantage to this type of system is that relatively less skilledcall assistants may be retained at a lesser cost to perform the callassistant tasks. A related advantage is that the stress level on callassistants may be reduced appreciably by eliminating the need to bothtranscribe and correct at high speeds and therefore call assistantturnover at relays may be appreciably reduced which ultimately reducescosts associated with providing relay services.

A similar system may include an assisted user's device that links tosome other third party provider transcription/caption server (e.g., inthe “cloud”) to obtain initial captioned text which is immediatelydisplayed to an assisted user and which is also transmitted to the relayfor call assistant correction. Here, again, the call assistantcorrections may be used by the third party provider to train thesoftware on the fly to the hearing user's voice. In this case, theassisted user's device may have three separate links, one to the hearinguser, a second link to a third party provider server, and a third linkto the relay. In other cases, the relay may create the link to the thirdparty server for AVR services. Here, the relay would provide the HU'svoice signal to the third party server, would receive text back from theserver to transmit to the AU device and would receive corrections fromthe CA to transmit to each of the AU device and the server. The thirdparty server would then use the corrections to train the voice model tothe HU voice and would use the evolving model to continue AVRtranscription.

Referring to FIG. 11, a method 360 whereby an assisted user's devicetranscribes a hearing user's voice to text and where corrections aremade to the text at a relay is illustrated. At block 362 a hearinguser's voice messages are received at an assisted user's device 12 (seealso again FIG. 1). At block 364 the assisted user's device runsvoice-to-text software to generate text from the received voice messagesand at block 366 the generated text is presented to the assisted uservia display 18. At block 370 the transcribed text is transmitted to therelay 16 and at block 372 the text is presented to a call assistant viathe call assistant's display 50. At block 374 the call assistantcorrects the text and at block 376 corrected blocks of text aretransmitted to the assisted user's device 12. At block 378 the assisteduser's device 12 uses the corrected blocks to correct the text errorsvia in line correction. At block 380, the assisted user's device usesthe errors, the corrected text and the voice messages to train thecaptioning software to the hearing user's voice.

In some cases instead of having a relay or an assisted user's device runautomated voice-to-text transcription software, a hearing user's devicemay include a processor that runs transcription software to generatetext corresponding to the hearing user's voice messages. To this end,device 14 may, instead of including a simple telephone, include acomputer that can run various applications including a voice-to-textprogram or may link to some third party real time transcription softwareprogram (e.g., software run by a third party server in the “cloud”) toobtain an initial text transcription substantially in real time. Here,as in the case where an assisted user's device runs the transcriptionsoftware, the text will often have more errors than allowed by thestandard accuracy requirements. Again, to correct the errors, the textand the hearing user's voice messages are transmitted to relay 16 wherea call assistant listens to the voice messages, observes the text onscreen 18 and makes corrections to eliminate transcription errors. Thecorrected blocks of text are transmitted to the assisted user's devicefor display. The corrected blocks may also be transmitted back to thehearing user's device for training the captioning software to thehearing user's voice. In these cases the text transcribed by the hearinguser's device and the hearing user's voice messages may either betransmitted directly from the hearing user's device to the relay or maybe transmitted to the assisted user's device 12 and then on to therelay. Where the hearing user's voice messages and text are transmitteddirectly to the relay 16, the voice messages and text may also betransmitted directly to the assisted user's device for immediatebroadcast and display and the corrected text blocks may be subsequentlyused for in line correction.

In these cases the caption request option may be supported so that anassisted user can initiate captioning during an on-going call at anytime by simply transmitting a signal to the hearing user's deviceinstructing the hearing user's device to start the captioning process.Similarly, in these cases the help request option may be supported.Where the help option is facilitated, the automated text may bepresented via the assisted user's device and, if the assisted userperceives that too many text errors are being generated, the help buttonmay be selected to cause the hearing user's device or the assisteduser's device to transmit the automated text to the relay for callassistant correction.

One advantage to having a hearing user's device manage or performvoice-to-text transcription is that the voice signal being transcribedcan be a relatively high quality voice signal. To this end, a standardphone voice signal has a range of frequencies between 300 and about 3000Hertz which is only a fraction of the frequency range used by mostvoice-to-text transcription programs and therefore, in many cases,automated transcription software does only a poor job of transcribingvoice signals that have passed through a telephone connection. Wheretranscription can occur within a digital signal portion of an overallsystem, the frequency range of voice messages can be optimized forautomated transcription. Thus, where a hearing user's computer that isall digital receives and transcribes voice messages, the frequency rangeof the messages is relatively large and accuracy can be increasedappreciably. Similarly, where a hearing user's computer can send digitalvoice messages to a third party transcription server accuracy can beincreased appreciably.

In at least some configurations it is contemplated that the link betweenan assisted user's device 12 and a hearing user's device 14 may beeither a standard analog phone type connection or may be a digitalconnection depending on the capabilities of the hearing user's devicethat links to the assisted user's device. Thus, for instance, a firstcall may be analog and a second call may be digital. Because digitalvoice messages have a greater frequency range and therefore can beautomatically transcribed more accurately than analog voice messages inmany cases, it has been recognized that a system where automatedvoice-to-text program use is implemented on a case by case basisdepending upon the type of voice message received (e.g., digital oranalog) would be advantageous. For instance, in at least someembodiments, where a relay receives an analog voice message fortranscription, the relay may automatically link to a call assistant forfull call assistant transcription service where the call assistanttranscribes and corrects text via revoicing and keyboard manipulationand where the relay receives a high definition digital voice message fortranscription, the relay may run an automated voice-to-texttranscription program to generate automated text. The automated text mayeither be immediately corrected by a call assistant or may only becorrected by an assistant after a help feature is selected by anassisted user as described above.

Referring to FIG. 12, one process 400 for treating high definitiondigital messages differently than analog voice messages is illustrated.Referring also to FIG. 1, at block 402 a hearing user's voice messagesare received at a relay 16. At decision block 404, relay server 30determines if the received voice message is a high definition digitalmessage or is an analog message. Where a high definition message hasbeen received, control passes to block 406 where server 30 runs anautomated voice-to-text program on the voice messages to generateautomated text. At block 408 the automated text is transmitted to theassisted user's device 12 for display. Referring again to block 404,where the hearing user's voice messages are in analog, control passes toblock 412 where a link to a call assistant is established so that thehearing user's voice messages are provided to a call assistant. At block414 the call assistant listens to the voice messages and transcribes themessages into text. Error correction may also be performed at block 414.After block 414, control passes to block 408 where the call assistantgenerated text is transmitted to the assisted user's device 12. Again,in some cases, when automated text is presented to an assisted user, ahelp button may be presented that, when selected causes automated textto be presented to a call assistant for correction. In other casesautomated text may be automatically presented to a call assistant forcorrection.

Another system is contemplated where all incoming calls to a relay areinitially assigned to a call assistant for at least initial captioningwhere the option to switch to automated software generated text is onlyavailable when the call includes high definition audio and afteraccuracy standards have been exceeded. Here, all analog hearing user'svoice messages would be captioned by a call assistant from start tofinish and any high definition calls would cut out the call assistantwhen the standard is exceeded.

In at least some cases where an assisted user's device is capable ofrunning automated voice-to-text transcription software, the assisteduser's device 12 may be programmed to select either automatedtranscription when a high definition digital voice message is receivedor a relay with a call assistant when an analog voice message isreceived. Again, where device 12 runs an automated text program, callassistant correction may be automatic or may only start when a helpbutton is selected.

FIG. 13 shows a process 430 whereby an assisted user's device 12 selectseither automated voice-to-text software or a call assistant totranscribe based on the type (e.g., digital or analog) of voice messagesreceived. At block 432 a hearing user's voice messages are received byan assisted user's device 12. At decision block 434, a processor indevice 12 determines if the assisted user has selected a help button.Initially no help button is selected as no text has been presented so atleast initially control passes to block 436. At decision block 436, thedevice processor determines if a hearing user's voice signal that isreceived is high definition digital or is analog. Where the receivedsignal is high definition digital, control passes to block 438 where theassisted user's device processor runs automated voice-to-text softwareto generate automated text which is then displayed on the assisted userdevice display 18 at block 440. Referring still to FIG. 13, if the helpbutton has been selected at block 434 or if the received voice messagesare in analog, control passes to block 442 where a link to a callassistant at relay 16 is established and the hearing user's voicemessages are transmitted to the relay. At block 444 the call assistantlistens to the voice messages and generates text and at block 446 thetext is transmitted to the assisted user's device 12 where the text isdisplayed at block 440.

In has been recognized that in many cases most calls facilitated usingan assisted user's device will be with a small group of other hearing ornon-hearing users. For instance, in many cases as much as 70 to 80percent of all calls to an assisted user's device will be with one offive or fewer hearing user's devices (e.g., family, close friends, aprimary care physician, etc.). For this reason it has been recognizedthat it would be useful to store voice-to-text models for at leastroutine callers that link to an assisted user's device so that theautomated voice-to-text training process can either be eliminated orsubstantially expedited. For instance, when an assisted user initiates acaptioning service, if a previously developed voice model for a hearinguser can be identified quickly, that model can be used without a newtraining process and the switchover from a full service call assistantto automated captioning may be expedited (e.g., instead of taking aminute or more the switchover may be accomplished in 15 seconds or less,in the time required to recognize or distinguish the hearing user'svoice from other voices).

FIG. 14 shows a sub-process 460 that may be substituted for a portion ofthe process shown in FIG. 3 wherein voice-to-text templates or modelsalong with related voice recognition profiles for callers are stored andused to expedite the handoff to automated transcription. Prior torunning sub-process 460, referring again to FIG. 1, server 30 is used tocreate a voice recognition database for storing hearing user deviceidentifiers along with associated voice recognition profiles andassociated voice-to-text models. A voice recognition profile is a dataconstruct that can be used to distinguish one voice from others.

In the context of the FIG. 1 system, voice recognition profiles areuseful because more than one person may use a hearing user's device tocall an assisted user. For instance in an exemplary case, an assisteduser's son or daughter-in-law or one of any of three grandchildren mayuse device 14 to call an assisted user and therefore, to access thecorrect voice-to-text model, server 30 needs to distinguish whichcaller's voice is being received. Thus, in many cases, the voicerecognition database will include several voice recognition profiles foreach hearing user device identifier (e.g., each hearing user phonenumber). A voice-to-text model includes parameters that are used tocustomize voice-to-text software for transcribing the voice of anassociated hearing user to text.

The voice recognition database will include at least one voice model foreach voice profile to be used by server 30 to automate transcriptionwhenever a voice associated with the specific profile is identified.Data in the voice recognition database will be generated on the fly asan assisted user uses device 12. Thus, initially the voice recognitiondatabase will include a simple construct with no device identifiers,profiles or voice models.

Referring still to FIGS. 1 and 14 and now also to FIG. 3, at decisionblock 84 in FIG. 3, if the help flag is still zero (e.g., an assisteduser has not requested call assistant help to correct automated texterrors) control may pass to block 464 in FIG. 13 where the hearinguser's device identifier (e.g., a phone number, an IP address, a serialnumber of a hearing user's device, etc.) is received by server 30. Atblock 468 server 30 determines if the hearing user's device identifierhas already been added to the voice recognition database. If the hearinguser's device identifier does not appear in the database (e.g., thefirst time the hearing user's device is used to connect to the assisteduser's device) control passes to block 482 where server 30 uses ageneral voice-to-text program to convert the hearing user's voicemessages to text after which control passes to block 476. At block 476the server 30 trains a voice-to-text model using transcription errors.Again, the training will include comparing call assistant generated textto automated text to identify errors and using the errors to adjustmodel parameters so that the next time a word associated with an erroris uttered by the hearing user, the software will identify the correctword. At block 478, server 30 trains a voice profile for the hearinguser's voice so that the next time the hearing user calls, a voiceprofile will exist for the specific hearing user that can be used toidentify the hearing user. At block 480 the server 30 stores the voiceprofile and voice model for the hearing user along with the hearing userdevice identifier for future use after which control passes back up toblock 94 in FIG. 3.

Referring still to FIGS. 1 and 14, at block 468, if the hearing user'sdevice is already represented in the voice recognition database, controlpasses to block 470 where server 30 runs voice recognition software onthe hearing user's voice messages in an attempt to identify a voiceprofile associated with the specific hearing user. At decision block472, if the hearing user's voice does not match one of the previouslystored voice profiles associated with the device identifier, controlpasses to block 482 where the process described above continues. Atblock 472, if the hearing user's voice matches a previously storedprofile, control passes to block 474 where the voice model associatedwith the matching profile is used to tune the voice-to-text software tobe used to generate automated text.

Referring still to FIG. 14, at blocks 476 and 478, the voice model andvoice profile for the hearing user are continually trained. Continualtraining enables the system to constantly adjust the model for changesin a hearing user's voice that may occur over time or when the hearinguser experiences some physical condition (e.g., a cold, a raspy voice)that affects the sound of their voice. At block 480, the voice profileand voice model are stored with the HU device identifier for future use.

In at least some embodiments, server 30 may adaptively change the orderof voice profiles applied to a hearing user's voice during the voicerecognition process. For instance, while server 30 may store fivedifferent voice profiles for five different hearing users that routinelyconnect to an assisted user's device, a first of the profiles may beused 80 percent of the time. In this case, when captioning is commenced,server 30 may start by using the first profile to analyze a hearinguser's voice at block 472 and may cycle through the profiles from themost matched to the least matched.

To avoid server 30 having to store a different voice profile and voicemodel for every hearing person that communicates with an assisted uservia device 12, in at least some embodiments it is contemplated thatserver 30 may only store models and profiles for a limited number (e.g.,5) of frequent callers. To this end, in at least some cases server 30will track calls and automatically identify the most frequent hearinguser devices used to link to the assisted user's device 12 over somerolling period (e.g., 1 month) and may only store models and profilesfor the most frequent callers. Here, a separate counter may bemaintained for each hearing user device used to link to the assisteduser's device over the rolling period and different models and profilesmay be swapped in and out of the stored set based on frequency of calls.

In other embodiments server 30 may query an assisted user for someindication that a specific hearing user is or will be a frequent contactand may add that person to a list for which a model and a profile shouldbe stored for a total of up to five persons.

While the system described above with respect to FIG. 14 assumes thatthe relay 16 stores and uses voice models and voice profiles that aretrained to hearing user's voices for subsequent use, in at least someembodiments it is contemplated that an assisted user's device 12processor may maintain and use or at least have access to and use thevoice recognition database to generate automated text without linking toa relay. In this case, because the assisted user's device runs thesoftware to generate the automated text, the software for generatingtext can be trained any time the user's device receives a hearing user'svoice messages without linking to a relay. For example, during a callbetween a hearing user and an assisted user on devices 14 and 12,respectively, in FIG. 1, and prior to an assisted user requestingcaptioning service, the voice messages of even a new hearing user can beused by the assisted user's device to train a voice-to-text model and avoice profile for the user. In addition, prior to a caption request, asthe model is trained and gets better and better, the model can be usedto generate text that can be used as fill in text (e.g., textcorresponding to voice messages that precede initiation of thecaptioning function) when captioning is selected.

FIG. 15 shows a process 500 that may be performed by an assisted user'sdevice to train voice models and voice profiles and use those models andprofiles to automate text transcription until a help button is selected.Referring also to FIG. 1, at block 502, an assisted user's device 12processor receives a hearing user's voice messages as well as anidentifier (e.g. a phone number) of the hearing user's device 14. Atblock 504 the processor determines if the assisted user has selected thehelp button (e.g., indicating that current captioning includes too manyerrors). If an assisted user selects the help button at block 504,control passes to block 522 where the assisted user's device is linkedto a call assistant at relay 16 and the hearing user's voice ispresented to the call assistant. At block 524 the assisted user's devicereceives text back from the relay and at block 534 the call assistantgenerated text is displayed on the assisted user's device display 18.

Where the help button has not been selected, control passes to block 505where the processor uses the device identifier to determine if thehearing user's device is represented in the voice recognition database.Where the hearing user's device is not represented in the databasecontrol passes to block 528 where the processor uses a generalvoice-to-text program to convert the hearing user's voice messages totext after which control passes to block 512.

Referring again to FIGS. 1 and 15, at block 512 the processor adaptivelytrains the voice model using perceived errors in the automated text. Tothis end, one way to train the voice model is to generate textphonetically and thereafter perform a context analysis of each text wordby looking at other words proximate the word to identify errors. Anotherexample of using context to identify errors is to look at severalgenerated text words as a phrase and compare the phrase to similar priorphrases that are consistent with how the specific hearing user stringswords together and identify any discrepancies as possible errors. Atblock 514 a voice profile for the hearing user is generated from thehearing user's voice messages so that the hearing user's voice can berecognized in the future. At block 516 the voice model and voice profilefor the hearing user are stored for future use during subsequent callsand then control passes to block 518 where the process described abovecontinues. Thus, blocks 528, 512, 514 and 516 enable the assisted user'sdevice to train voice models and voice profiles for hearing users thatcall in anew where a new voice model can be used during an ongoing calland during future calls to provide generally accurate transcription.

Referring still to FIGS. 1 and 15, if the hearing user's device isalready represented in the voice recognition database at block 505,control passes to block 506 where the processor runs voice recognitionsoftware on the hearing user's voice messages in an attempt to identifyone of the voice profiles associated with the device identifier. Atblock 508, where no voice profile is recognized, control passes to block528.

At block 508, if the hearing user's voice matches one of the storedvoice profiles, control passes to block 510 where the voice-to-textmodel associated with the matching profile is used to generate automatedtext from the hearing user's voice messages. Next, at block 518, theassisted user's device processor determine if the caption button on theassisted user's device has been selected. If captioning has not beenselected control passes to block 502 where the process continues tocycle. Once captioning has been requested, control passes to block 520where assisted user's device 12 displays the most recent 10 seconds ofautomated text and continuing automated text on display 18.

In at least some embodiments it is contemplated that different types ofvoice model training may be performed by different processors within theoverall FIG. 1 system. For instance, while an assisted user's device isnot linked to a relay, the assisted user's device cannot use any errorsidentified by a call assistance at the relay to train a voice model asno call assistant is generating errors. Nevertheless, the assisteduser's device can use context and confidence factors to identify errorsand train a model. Once an assisted user's device is linked to a relaywhere a call assistant corrects errors, the relay server can use thecall assistant identified errors and corrections to train a voice modelwhich can, once sufficiently accurate, be transmitted to the assisteduser's device where the new model is substituted for the old contentbased model or where the two models are combined into a single robustmodel in some fashion. In other cases when an assisted user's devicelinks to a relay for call assistant captioning, a context based voicemodel generated by the assisted user's device for the hearing user maybe transmitted to the relay server and used as an initial model to befurther trained using call assistant identified errors and corrections.In still other cases call assistant errors may be provided to theassisted user's device and used by that device to further train acontext based voice model for the hearing user.

Referring now to FIG. 16, a sub-process 550 that may be added to theprocess shown in FIG. 15 whereby an assisted user's device trains avoice model for a hearing user using voice message content and a relayserver further trains the voice model generated by the assisted user'sdevice using call assistant identified errors is illustrated. Referringalso to FIG. 15, sub-process 550 is intended to be performed in parallelwith block 524 and 534 in FIG. 15. Thus, after block 522, in addition toblock 524, control also passes to block 552 in FIG. 16. At block 552 thevoice model for a hearing user that has been generated by an assisteduser's device 12 is transmitted to relay 16 and at block 553 the voicemodel is used to modify a voice-to-text program at the relay. At block554 the modified voice-to-text program is used to convert the hearinguser's voice messages to automated text. At block 556 the call assistantgenerated text is compared to the automated text to identify errors. Atblock 558 the errors are used to further train the voice model. At block560, if the voice model has an accuracy below the required standard,control passes back to block 502 in FIG. 15 where the process describedabove continues to cycle. At block 560, once the accuracy exceeds thestandard requirement, control passes to block 562 wherein server 30transmits the trained voice model to the assisted user's device forhandling subsequent calls from the hearing user for which the model wastrained. At block 564 the new model is stored in the database maintainedby the assisted user's device.

Referring still to FIG. 16, in addition to transmitting the trainedmodel to the assisted user's device at block 562, once the model isaccurate enough to meet the standard requirements, server 30 may performan automated process to cut out the call assistant and instead transmitautomated text to the assisted user's device as described above inFIG. 1. In the alternative, once the model has been transmitted to theassisted user's device at block 562, the relay may be programmed to handoff control to the assisted user's device which would then use the newlytrained and relatively more accurate model to perform automatedtranscription so that the relay could be disconnected.

Several different concepts and aspects of the present disclosure havebeen described above. It should be understood that many of the conceptsand aspects may be combined in different ways to configure other triagesystems that are more complex. For instance, one exemplary system mayinclude an assisted user's device that attempts automated captioningwith on the fly training first and, when automated captioning by theassisted user's device fails (e.g., a help icon is selected by anassisted user), the assisted user's device may link to a third partycaptioning system via the internet or the like where another moresophisticated voice-to-text captioning software is applied to generateautomated text. Here, if the help button is selected a second time or a“call assistant” button is selected, the assisted user's device may linkto a call assistant at the relay for call assistant captioning withsimultaneous voice-to-text software transcription where errors in theautomated text are used to train the software until a threshold accuracyrequirement is met. Here, once the accuracy requirement is exceeded, thesystem may automatically cut out the call assistant and switch to theautomated text from the relay until the help button is again selected.In each of the transcription hand offs, any learning or model trainingperformed by one of the processors in the system may be provided to thenext processor in the system to be used to expedite the trainingprocess.

In at least some embodiments an automated voice-to-text engine may beutilized in other ways to further enhance calls handled by a relay. Forinstance, in cases where transcription by a call assistant lags behind ahearing user's voice messages, automated transcription software may beprogrammed to transcribe text all the time and identify specific wordsin a hearing user's voice messages to be presented via an assisteduser's display immediately when identified to help the assisted userdetermine when a hearing user is confused by a communication delay. Forinstance, assume that transcription by a call assistant lags a hearinguser's most current voice message by 20 seconds and that an assisteduser is relying on the call assistant generated text to communicate withthe hearing user. In this case, because the call assistant generatedtext lag is substantial, the hearing user may be confused when theassisted user's response also lags a similar period and may generate avoice message questioning the status of the call. For instance, thehearing user may utter “Are you there?” or “Did you hear me?” or “Hello”or “What did you say?”. These phrases and others like them querying callstatus are referred to herein as “line check words” (LCWs) as thehearing user is checking the status of the call on the line.

If the line check words are not presented until they occurredsequentially in the hearing user's voice messages, they would be delayedfor 20 or more seconds in the above example. In at least someembodiments it is contemplated that the automated voice engine maysearch for line check words (e.g., 50 common line check phrases) in ahearing user's voice messages and present the line check wordsimmediately via the assisted user's device during a call regardless ofwhich words have been transcribed and presented to an assisted user. Theassisted user, seeing line check words or a phrase can verbally respondthat the captioning service is lagging but catching up so that theparties can avoid or at least minimize confusion.

When line check words are presented to an assisted user the words may bepresented in-line within text being generated by a call assistant withintermediate blanks representing words yet to be transcribed by the callassistant. To this end, see again FIG. 17 that shows line check words“Are you still there?” in a highlighting box 590 at the end ofintermediate blanks 216 representing words yet to be transcribed by thecall assistant. Line check words will, in at least some embodiments, behighlighted on the display or otherwise visually distinguished. In otherembodiments the line check words may be located at some prominentlocation on the assisted user's display screen (e.g., in a line checkbox or field at the top or bottom of the display screen).

One advantage of using an automated voice engine to only search forspecific words and phrases is that the engine can be tuned for thosewords and will be relatively more accurate than a general purpose enginethat transcribes all words uttered by a hearing user. In at least someembodiments the automated voice engine will be run by an assisted user'sdevice processor while in other embodiments the automated voice enginemay be run by the relay server with the line check words transmitted tothe assisted user's device immediately upon generation andidentification.

In still other cases where automated text is presented immediately upongeneration to an assisted user, line check words may be presented in avisually distinguished fashion (e.g., highlighted, in different color,as a distinct font, as a uniquely sized font, etc.) so that an assisteduser can distinguish those words from others and, where appropriate,provide a clarifying remark to a confused hearing user.

Referring now to FIG. 19, a process 600 that may be performed by anassisted user's device 12 and a relay to transcribe hearing user's voicemessages and provide line check words immediately to an assisted userwhen transcription by a call assistant lags in illustrated. At block 602a hearing user's voice messages are received by an assisted user'sdevice 12. After block 602 control continues along parallelsub-processes to blocks 604 and 612. At block 604 the assisted user'sdevice processor uses an automated voice engine to transcribe thehearing user's voice messages to text. Here, it is assumed that thevoice engine may generate several errors and therefore likely would beinsufficient for the purposes of providing captioning to the assisteduser. The engine, however, is optimized and trained to caption a set(e.g., 10 to 100) line check words and/or phrases which the engine cando extremely accurately. At block 606, the assisted user's deviceprocessor searches for line check words in the automated text. At block608, if a line check word or phrase is not identified control passesback up to block 602 where the process continues to cycle. At block 608,if a line check word or phrase is identified, control passes to block610 where the line check word/phrase is immediately presented (seephrase “Are you still there?” in FIG. 18) to the assisted user viadisplay 18 either in-line or in a special location and, in at least somecases, in a visually distinct manner.

Referring still to FIG. 19, at block 612 the hearing user's voicemessages are sent to a relay for transcription. At block 614,transcribed text is received at the assisted user's device back from therelay. At block 616 the text from the relay is used to fill in theintermediate blanks (see again FIG. 17 and also FIG. 18 where text hasbeen filled in) on the assisted user's display.

In at least some embodiments it is contemplated that an automatedvoice-to-text engine may operate all the time and may check for andindicate any potential errors in call assistant generated text so thatthe call assistant can determine if the errors should be corrected. Forinstance, in at least some cases, the automated voice engine mayhighlight potential errors in call assistant generated text on the callassistant's display screen inviting the call assistant to correct thepotential errors. In these cases the call assistant would have the finalsay regarding whether or not a potential error should be altered.

Consistent with the above comments, see FIG. 20 that shows a screen shotof a call assistant's display screen where potential errors have beenhighlighted to distinguish the errors from other text. Exemplary callassistant generated text is shown at 650 with errors shown in phantomboxes 652, 654 and 656 that represent highlighting. In the illustratedexample, exemplary words generated by an automated voice-to-text engineare also presented to the call assistant in hovering fields above thepotentially erroneous text as shown at 658, 660 and 662. Here, a callassistant can simply touch a suggested correction in a hovering field tomake a correction and replace the erroneous word with the automated textsuggested in the hovering field. If a call assistant instead touches anerror, the call assistant can manually change the word to another word.If a call assistant does not touch an error or an associated correctedword, the word remains as originally transcribed by the call assistant.An “Accept All” icon is presented at 669 that can be selected to acceptall of the suggestions presented on a call assistant's display. Allcorrected words are transmitted to an assisted user's device to bedisplayed.

Referring to FIG. 21, a method 700 by which a voice engine generatestext to be compared to call assistant generated text and for providing acorrection interface as in FIG. 20 for the call assistant isillustrated. At block 702 the hearing user's voice messages are providedto a relay. After block 702 control follows to two parallel paths toblocks 704 and 716. At block 704 the hearing user's voice messages aretranscribed into text by an automated voice-to-text engine run by therelay server before control passes to block 706. At block 716 a callassistant transcribes the hearing user's voice messages to callassistant generated text. At block 718 the call assistant generated textis transmitted to the assisted user's device to be displayed. At block720 the call assistant generated text is displayed on the callassistant's display screen 50 for correction after which control passesto block 706.

Referring still to FIG. 21, at block 706 the relay server compares thecall assistant generated text to the automated text to identify anydiscrepancies. Where the automated text matches the call assistantgenerated text at block 708, control passes back up to block 702 wherethe process continues. Where the automated text does not match the callassistant generated text at block 708, control passes to block 710 wherethe server visually distinguishes the mismatched text on the callassistant's display screen 50 and also presents suggested correct text(e.g., the automated text). Next, at block 712 the server monitors forany error corrections by the call assistant and at block 714 if an errorhas been corrected, the corrected text is transmitted to the assisteduser's device for in-line correction.

In at least some embodiments the relay server may be able to generatesome type of probability or confidence factor related to how likely adiscrepancy between automated and call assistant generated text isrelated to a call assistant error and may only indicate errors andpresent suggestions for probable errors or discrepancies likely to berelated to errors. For instance, where an automated text segment isdifferent than an associated call assistant generated text segment butthe automated segment makes no sense contextually in a sentence, theserver may not indicate the discrepancy or may not show the automatedtext segment as an option for correction. The same discrepancy may beshown as a potential error at a different time if the automated segmentmakes contextual sense.

In still other embodiments automated voice-to-text software thatoperates at the same time as a call assistant to generate text may betrained to recognize words often missed by a call assistant such asarticles, for instance, and to ignore other words that call assistantsmore accurately transcribe.

The particular embodiments disclosed above are illustrative only, as theinvention may be modified and practiced in different but equivalentmanners apparent to those skilled in the art having the benefit of theteachings herein. Furthermore, no limitations are intended to thedetails of construction or design herein shown, other than as describedin the claims below. It is therefore evident that the particularembodiments disclosed above may be altered or modified and all suchvariations are considered within the scope and spirit of the invention.Accordingly, the protection sought herein is as set forth in the claimsbelow.

Thus, the invention is to cover all modifications, equivalents, andalternatives falling within the spirit and scope of the invention asdefined by the following appended claims. For example, while the methodsabove are described as being performed by specific system processors, inat least some cases various method steps may be performed by othersystem processors. For instance, where a hearing user's voice isrecognized and then a voice model for the recognized hearing user isemployed for voice-to-text transcription, the voice recognition processmay be performed by an assisted user's device and the identified voicemay be indicated to a relay 16 which then identifies a related voicemodel to be used. As another instance, a hearing user's device mayidentify a hearing user's voice and indicate the identity of the hearinguser to the assisted user's device and/or the relay.

As another example, while the system is described above in the contextof a two line captioning system where one line links an assisted user'sdevice to a hearing user's device and a second line links the assisteduser's device to a relay, the concepts and features described above maybe used in any transcription system including a system where the hearinguser's voice is transmitted directly to a relay and the relay thentransmits transcribed text and the hearing user's voice to the assisteduser's device.

As still one other example, while inputs to an assisted user's devicemay include mechanical or virtual on screen buttons/icons, in someembodiments other inputs arrangements may be supported. For instance, insome cases help or a captioning request may be indicated via a voiceinput (e.g., verbal a request for assistance or for captioning).

As another example, in at least some cases where a relay includes firstand second differently trained call assistants where first callassistants are trained to be capable of transcribing and correcting textand second call assistants are only trained to be capable of correctingtext, a call assistant may always be on a call but the automatedvoice-to-text software may aid in the transcription process wheneverpossible to minimize overall costs. For instance, when a call isinitially linked to a relay so that a hearing user's voice is receivedat the relay, the hearing user's voice may be provided to a first callassistant fully trained to transcribe and correct text. Here,voice-to-text software may train to the hearing user's voice while thefirst call assistant transcribes the text and after the voice-to-textsoftware accuracy exceeds a threshold, instead of completely cutting outthe relay or call assistant, the automated text may be provided to asecond call assistant that is only trained to correct errors. Here,after training the automated text should have minimal errors andtherefore even a minimally trained call assistant should be able to makecorrections to the errors in a timely fashion. In other cases, a firstCA assigned to a call may only correct errors in automated voice-to-texttranscription and a fully trained revoicing and correcting CA may onlybe assigned after a help or caption request is received.

In other systems an assisted user's device processor may run automatedvoice-to-text software to transcribe hearing user's voice messages andmay also generate a confidence factor for each word in the automatedtext based on how confident the processor is that the word has beenaccurately transcribed. The confidence factors over a most recent numberof words (e.g., 100) or a most recent period (e.g., 45 seconds) may beaveraged and the average used to assess an overall confidence factor fortranscription accuracy. Where the confidence factor is below a thresholdlevel, the device processor may link to a relay for more accuratetranscription either via more sophisticated automated voice-to-textsoftware or via a call assistant. The automated process for linking to arelay may be used instead of or in addition to the process describedabove whereby an assisted user selects a “caption” button to link to arelay.

In addition to storing hearing user voice models, a system may alsostore other information that could be used when an assisted user iscommunicating with specific hearing user's to increase accuracy ofautomated voice-to-text software when used. For instance, a specifichearing user may routinely use complex words from a specific industrywhen conversing with an assisted user. The system software can recognizewhen a complex word is corrected by a call assistant or contextually byautomated software and can store the word and the pronunciation of theword by the specific hearing user in a hearing user word list forsubsequent use. Then, when the specific hearing user subsequently linksto the assisted user's device to communicate with the assisted user, thestored word list for the hearing user may be accessed and used toautomate transcription. The hearing user's word list may be stored at arelay, by an assisted user's device or even by a hearing user's devicewhere the hearing user's device has data storing capability.

In other cases a word list specific to an assisted user's device (i.e.,to an assisted user) that includes complex or common words routinelyused to communicate with the assisted user may be generated, stored andupdated by the system. This list may include words used on a regularbasis by any hearing user that communicates with an assisted user. In atleast some cases this list or the hearing user's word lists may bestored on an internet accessible database (e.g., in the “cloud”) so thatthe assisted user has the ability to access the list(s) and edit wordson the list via an internet portal or some other network interface.

Where an HU's complex or hard to spell word list and/or an AU's wordlist is available, when a CA is creating CA generated text (e.g., viarevoicing, typing, etc.), an AVR engine may always operate to search theHU voice signal to recognize when a complex or difficult to spell wordis annunciated and the complex or hard to spell words may beautomatically presented to the CA via the CA display screen in line withthe CA generated text to be considered by the CA. Here, while the CAwould still be able to change the automatically generated complex word,it is expected that CA correction of those words would not occur oftengiven the specialized word lists for the specific communicating parties.

In still other embodiments various aspects of a hearing user's voicemessages may be used to select different voice-to-text software programsthat are optimized for voices having different characteristic sets. Forinstance, there may be different voice-to-text programs optimized formale and female voices or for voices having different dialects. Here,system software may be able to distinguish one dialect from others andselect an optimized voice engine/software program to increasetranscription accuracy. Similarly, a system may be able to distinguish ahigh pitched voice from a low pitched voice and select a voice engineaccordingly.

In some cases a voice engine may be selected for transcribing a hearinguser's voice based on the region of a country in which a hearing user'sdevice resides. For instance, where a hearing user's device is locatedin the southern part of the United States, an engine optimized for asouthern dialect may be used while a device in New England may cause thesystem to select an engine optimized for another dialect. Different wordlists may also be used based on region of a country in which a hearinguser's device resides.

In at least some cases it is contemplated that an assisted user's devicewill provide a text or other indication to an assisted user to conveyhow text that appears on an AU device display 18 is being generated. Forinstance, when automated voice-to-text software (e.g., an automatedvoice recognition (AVR) system) is generating text, the phrase “SoftwareGenerated Text” may be persistently presented (see 729 in FIG. 22) atthe top of a display 18 and when CA generated text is presented, thephrase “Call Assistant Generated Text” (not illustrated) may bepresented. A phrase “Call Assistant Corrected Text” (not illustrated)may be presented when automated Text is corrected by a CA.

In some cases a set of virtual buttons (e.g., 68 in FIG. 1) ormechanical buttons may be provided via an AU device allowing an AU toselect captioning preferences. For instance, captioning options mayinclude “Automated/Software Generated Text”, “CA Generated Text” (seevirtual selection button 719 in FIG. 22) and “CA Corrected Text” (seevirtual selection button 721 in FIG. 22). This feature allows an AU topreemptively select a preference in specific cases or to select apreference dynamically during an ongoing call. For example, where an AUknows from past experience that calls with a specific HU result inexcessive automated text errors, the AU could select “CA generated text”to cause CA support to persist during the duration of a call with thespecific HU.

In at least some embodiments, automated voice-to-text accuracy may betracked by a system and indicated to any one or a subset of a CA, an AU,and an HU either during CA text generation or during automated textpresentation. Here, the accuracy value may be over the duration of anongoing call or over a short most recent rolling period or number ofwords (e.g., last 30 seconds, last 100 words, etc.), or for a mostrecent HU turn at talking. In some cases two averages, one over a fullcall period and the other over a most recent period, may be indicated.The accuracy values would be provided via the AU device display 18 (see728 in FIG. 22) and/or the CA workstation display 50. Where an HU devicehas a display (e.g., a smart phone, a tablet, etc.), the accuracyvalue(s) may be presented via that display in at least some cases. Tothis end, see the smart phone type HU device 800 in FIG. 24 where anaccuracy rate is displayed at 802 for a call with an AU. It is expectedthat seeing a low accuracy value would encourage an HU to try toannunciate words more accurately or slowly to improve the value.

Human communication has many different components and the meaningsascribed to text words are only one aspect of that communication. Oneother aspect of human non-text communication includes how words areannunciated which often belies a speakers emotions or other meaning. Forinstance, a simple change in volume while words are being spoken isoften intended to convey a different level of importance. Similarly, theduration over which a word is expressed, the tone or pitch used when aphrase is annunciated, etc., can convey a different meaning. Forinstance, annunciating the word “Yes” quickly can connote a differentmeaning than annunciating the word “Yes” very slowly or such that the“s” sound carries on for a period of a few seconds. A simple text wordrepresentation is devoid of a lot of meaning in an originally spokenphrase in many cases.

In at least some embodiments of the present disclosure it iscontemplated that volume changes, tone, length of annunciation, pitch,etc., of an HU's voice signal may be sensed by automated software andused to change the appearance of or otherwise visually distinguishtranscribed text that is presented to an AU via a device display 18 sothat the AU can more fully understand and participate in a richercommunication session. To this end, see, for instance, the two textualeffects 732 and 734 in AU device text 730 in FIG. 22 where an arroweffect 732 represents a long annunciation period while abolded/italicized effect 734 represents an appreciable change in HUvoice signal volume. Many other non-textual characteristics of an HUvoice signal are contemplated and may be sensed and each may have adifferent appearance. For instance, pitch, speed of speaking, etc., mayall be automatically determined and used to provide effect distinctvisual cues along with the transcribed text.

The visual cues may be automatically provided with or used todistinguish text presented via an AU device display regardless of thesource of the text. For example, in some cases automated text may besupplemented with visual cues to indicate other communicationcharacteristics and in at least some cases even CA generated text may besupplemented with automatically generated visual cues indicating how anHU annunciates various words and phrases. Here, as voice characteristicsare detected for an HU's utterances, software tracks the voicecharacteristics in time and associates those characteristics withspecific text words or phrases generated by the CA. Then, the visualcues for each voice characteristic are used to visually distinguish theassociated words when presented to the AU.

In at least some cases an AU may be able to adjust the degree to whichtext is enhanced via visual cues or even to select preferred visual cuesfor different voice characteristics. For instance, a specific AU mayfind fully enabled visual queuing to be distracting and instead may onlywant bold capital letter visual queuing when an HU's volume levelexceeds some threshold value. AU device preferences may be set via adisplay 18 during some type device of commissioning process.

In some embodiments it is contemplated that the automated software thatidentifies voice characteristics will adjust or train to an HU's voiceduring the first few seconds of a call and will continue to train tothat voice so that voice characteristic identification is normalized tothe HU's specific voice signal to avoid excessive visual queuing. Here,it has been recognized that some people's voices will have persistentvoice characteristics that would normally be detected as anomalies ifcompared to a voice standard (e.g., a typical male or female voice). Forinstance, a first HU may always speak loudly and therefore, if his voicesignal was compared to an average HU volume level, the voice signalwould exceed the average level most if not all the time. Here, to avoidalways distinguishing the first HU's voice signal with visual queuingindicating a loud voice, the software would use the HU voice signal todetermine that the first HU's voice signal is persistently loud andwould normalize to the loud signal so that words uttered within a rangeof volumes near the persistent loud volume would not be distinguished asloud. Here, if the first HU's voice signal exceeds the range about hispersistent volume level, the exceptionally loud signal may be recognizedas a clear deviation from the persistent volume level for the normalizedvoice and therefore distinguished with a visual queue for the AU whenassociated text is presented. The voice characteristic recognizingsoftware would automatically train to the persistent voicecharacteristics for each HU including for instance, pitch, tone, speedof annunciation, etc., so that persistent voice characteristics ofspecific HU voice signals are not visually distinguished as anomalies.

In at least some cases, as in the case of voice models developed andstored for specific HUs, it is contemplated that HU voice models mayalso be automatically developed and stored for specific HU's forspecifying voice characteristics. For instance, in the above examplewhere a first HU has a particularly loud persistent voice, the volumerange about the first HU's persistent volume as well as other persistentcharacteristics may be determined once during an initial call with an AUand then stored along with a phone number or other HU identifyinginformation in a system database. Here, the next time the first HUcommunicates with an AU via the system, the HU voice characteristicmodel would be automatically accessed and used to detect voicecharacteristic anomalies and to visually distinguish accordingly.

Referring again to FIG. 22, in addition to changing the appearance oftranscribed text to indicate annunciation qualities or characteristics,other visual cues may be presented. For instance, if an HU persistentlytalks in a volume that is much higher than typical for the HU, a volumeindicator 717 may be presented or visually altered in some fashion toindicate the persistent volume. As another example, a volume indicator715 may be presented above or otherwise spatially proximate any wordannunciated with an unusually high volume. In some cases thedistinguishing visual queue for a specially annunciated word may onlypersist for a short duration (e.g., 3 seconds, until the end of arelated sentence or phrase, for the next 5 words of an utterance, etc.)and then be eliminated. Here, the idea is that the visual queuing issupposed to mimic the effect of an annunciated word or phrase which doesnot persist long term (e.g., the loud effect of a high volume word onlypersists as the word is being annunciated).

The software used to generate the HU voice characteristic models and/orto detect voice anomalies to be visually distinguished may be run viaany of an HU device processor, an AU device processor, a relay processorand a third party operated processor linkable via the internet or someother network. In at least some cases it will be optimal for an HUdevice to develop the HU model for an HU that is associated with thedevice and to store the model and apply the model to the HU's voice todetect anomalies to be visually distinguished for several reasons. Inthis regard, a particularly rich acoustic HU voice signal is availableat the HU device so that anomalies can be better identified in manycases by the HU device as opposed to some processor downstream in thecaptioning process.

Referring again to FIG. 24, in at least some embodiments where an HUdevice 800 includes a display screen 801, an HU voice text transcription804 may also be presented via the HU device. Here, an HU viewing thetranscribed text could formulate an independent impression oftranscription accuracy and whether or not a more robust transcriptionprocess (e.g., CA generation of text) is required or would be preferred.In at least some cases a virtual “CA request” button 806 or the like maybe provided on the HU screen for selection so that the HU has theability to initiate CA text transcription and or CA correction of text.Here, an HU device may also allow an HU to switch back to automated textif an accuracy value 802 exceeds some threshold level. Where HU voicecharacteristics are detected, those characteristics may be used tovisually distinguish text at 804 in at least some embodiments.

Where an HU device is a smart phone, a tablet computing device or someother similar device capable of downloading software applications froman application store, it is contemplated that a captioning applicationmay be obtained from an application store for communication with one ormore AU devices 12. For instance, the son or daughter of an AU maydownload the captioning application to be used any time the device usercommunicates with the AU. Here, the captioning application may have anyof the functionality described in this disclosure and may result in amuch better overall system in various ways.

For instance, a captioning application on an HU device may run automatedvoice-to-text software on a digital HU voice signal as described abovewhere that text is provided to the AU device 12 for display and, attimes, to a relay for correction, voice model training, voicecharacteristic model training, etc. As another instance, an HU devicemay train a voice model for an HU any time an HU's voice signal isobtained regardless of whether or not the HU is participating in a callwith an AU. For example, if a dictation application on an HU devicewhich is completely separate from a captioning application is used todictate a letter, the HU voice signal during dictation may be used totrain a general HU voice model for the HU and, more specifically, ageneral model that can be used subsequently by the captioning system orapplication. Similarly, an HU voice signal captured during entry of asearch phrase into a browser or an address into mapping software whichis independent of the captioning application may be used to furthertrain the general voice model for the HU. Here, the general voice modelmay be extremely accurate even before used in by AU captioningapplication. In addition, an accuracy value for an HU's voice model maybe calculated prior to an initial AU communication so that, if theaccuracy value exceeds a high or required accuracy standard, automatedtext transcription may be used for an HU-AU call without requiring CAassistance, at least initially.

For instance, prior to an initial AU call, an HU device processortraining to an HU voice signal may assign confidence factors to textwords automatically transcribed by an AVR engine from HU voice signals.As the software trains to the HU voice, the confidence factor valueswould continue to increase and eventually should exceed some thresholdlevel at which initial captioning during an AU communication would meetaccuracy requirements set by the captioning industry.

As another instance, an HU voice model stored by or accessible by the HUdevice can be used to automatically transcribe text for any AU devicewithout requiring continual redevelopment of the HU voice model. Thus,one HU device may be used to communicate with two separate hearingimpaired persons using two different AU devices without each sub-systemredeveloping the HU voice model.

As yet another instance, an HU's smart phone or tablet device running acaptioning application may link directly to each of a relay and an AU'sdevice to provide one or more of the HU voice signal, automated textand/or an HU voice model or voice characteristic model to each. This maybe accomplished through two separate phone lines or via two channels ona single cellular line or via any other combination of two communicationlinks.

In some cases an HU voice model may be generated by a relay or an AU'sdevice or some other entity (e.g., a third party AVR engine provider)over time and the HU voice model may then be stored on the HU device orrendered accessible via that device for subsequent transcription. Inthis case, one robust HU voice model may be developed for an HU by anysystem processor or server independent of the HU device and may then beused with any AU device and relay for captioning purposes.

In still other cases, at least one system processor may monitor andassess line and/or audio conditions associated with a call and maypresent some type of indication to each or a subset of an AU, an HU anda CA to help each or at least one of the parties involved in a call toassess communication quality. For instance, an HU device may be able toindicate to an AU and a CA if the HU device is being used as a speakerphone which could help explain an excessive error rate and help with adecision related to CA captioning involvement. As another instance, anHU's device may independently assess the level of non-HU voice signalnoise being picked up by an HU device microphone and, if the determinednoise level exceeds some threshold value either by itself or in relationto the signal strength of the HU voice signal, may perform somefunction. For example, one function may be to provide a signal to the HUindicating that the noise level is high. Another function may be toprovide a noise level signal to the CA or the AU which could beindicated on one or both of the displays 50 and 18. Yet another functionwould be to offer one or more captioning options to any of the HU or AUor even to a text correcting CA when the noise level exceeds thethreshold level. Here, the idea is that as the noise level increases,the likelihood of accurate AVR captioning will typically decrease andtherefore more accurate and robust captioning options should beavailable.

As another instance, an HU device may transmit a known signal to an AUdevice which returns the known signal to the HU device and the HU devicemay compare the received signal to the known signal to determine line orcommunication link quality. Here, the HU may present a line qualityvalue as shown at 808 in FIG. 24 for the HU to consider. Similarly, anAU device may present a line quality signal (not illustrated) to the AUto be considered.

In some cases system devices may monitor a plurality of different systemoperating characteristics such as line quality, speaker phone use,non-voice noise level, voice volume level, voice signal pace, etc., andmay present one or more “coaching” indications to any one of or a subsetof the HU, CA and AU for consideration. Here, the coaching indicationsshould help the parties to a call understand if there is something theycan do to increase the level of captioning accuracy. Here, in at leastsome cases only the most impactful coaching indications may be presentedand different entities may receive different coaching indications. Forinstance, where noise at HU location exceeds a threshold level, a noiseindicating signal may only be presented to the HU. Where the system alsorecognizes that line quality is only average, that indication may bepresented to the AU and not to the HU while the HU's noise level remainshigh. If the HU moves to a quieter location, the noise level indicationon the HU device may be replaced with a line quality indication. Thus,the coaching indications should help individual call entities recognizecommunication conditions that they can effect or that may be the causeof or may lead to poor captioning results for the AU.

In some cases coaching may include generating a haptic feedback oraudible signal or both and a text message for an HU and/or an AU. Tothis end, while AU's routinely look at their devices to see captionsduring a caption assisted call, many HUs do not look at their devicesduring a call and simply rely on audio during communication. In the caseof an AU, in some cases even when captioning is presented to an AU theAU may look away from their device display at times when their hearingis sufficient. By providing a haptic or audible or both additionalsignals, a user's attention can be drawn to their device displays wherea warning or call state text message may present more information suchas, for instance, an instruction to “Speak louder” or “Move to a lessnoisy space”, for consideration.

In some embodiments an AU may be able to set a maximum text lag timesuch that automated text generated by an AVR engine is used to drive anAU device screen 18 when a CA generated text lag reaches the maximumvalue. For instance, an AU may not want text to lag behind a broadcastHU voice signal by more than 7 seconds and may be willing to accept agreater error rate to stay within the maximum lag time period. Here, CAcaptioning/correction may proceed until the maximum lag time occurs atwhich point automated text may be used to fill in the lag period up to acurrent HU voice signal on the AU device and the CA may be skipped aheadto the current HU signal automatically to continue the captioningprocess. Again, here, any automated fill in text or text not correctedby a CA may be visually distinguished on the AU device display as wellas on the CA display for consideration.

It has been recognized that many AU's using text to understand abroadcast HU voice signal prefer that the text lag behind the voicesignal at least some short amount of time. For instance, an AU talkingto an HU may stair off into space while listening to the HU voice signaland, only when a word or phrase is not understood, may look to text ondisplay 18 for clarification. Here, if text were to appear on a display18 immediately upon audio broadcast to an AU, the text may be severalwords beyond the misunderstood word by the time the AU looks at thedisplay so that the AU would be required to hunt for the word. For thisreason, in at least some embodiments, a short minimum text delay may beimplemented prior to presenting text on display 18. Thus, all text wouldbe delayed at least 2 seconds in some cases and perhaps longer where atext generation lag time exceeds the minimum lag value. As with otheroperating parameters, in at least some cases an AU may be able to adjustthe minimum voice-to-text lag time to meet a personal preference.

It has been recognized that in cases where transcription switchesautomatically from a CA to an AVR engine when text lag exceeds somemaximum lag time, it will be useful to dynamically change the thresholdperiod as a function of how a communication between an HU and an AU isprogressing. For instance, periods of silence in an HU voice signal maybe used to automatically adjust the maximum lag period. For example, insome cases if silence is detected in an HU voice signal for more thanthree seconds, the threshold period to change from CA text to automatictext generation may be shortened to reflect the fact that when the HUstarts speaking again, the CA should be closer to a caught up state.Then, as the HU speaks continuously for a period, the threshold periodmay again be extended. The threshold period prior to automatictransition to the AVR engine to reduce or eliminate text lag may bedynamically changed based on other operating parameters. For instance,rate of error correction by a CA, confidence factor average in AVR text,line quality, noise accompanying the HU voice signal, or any combinationof these and other factors may be used to change the threshold period.

One aspect described above relates to an AVR engine recognizing specificor important phrases like questions (e.g., see phrase “Are you stillthere?”) in FIG. 18 prior to CA text generation and presenting thosephrases immediately to an AU upon detection. Other important phrases mayinclude phrases, words or sound anomalies that typically signify “turnmarkers” (e.g., words or sounds often associated with a change inspeaker from AU to HU or vice versa). For instance, if an HU utters thephrase “What do you think?” followed by silence, the combinationincluding the silent period may be recognized as a turn marker and thephrase may be presented immediately with space markers (e.g., underlinedspaces) between CA text and the phrase to be filled in by the CA texttranscription once the CA catches up to the turn marker phrase.

To this end, see the text at 731 in FIG. 22 where CA generated text isshown at 733 with a lag time indicated by underlined spaces at 735 andan AVR recognized turn marker phrase presented at 737. In this type ofsystem, in some cases the AVR engine will be programmed with a small set(e.g., 100-300) of common turn marker phrases that are specificallysought in an HU voice signal and that are immediately presented to theAU when detected. In some cases, non-text voice characteristics like thechange in sound that occurs at the end of a question which is often thesignal for a turn marker may be sought in an HU voice signal and any AVRgenerated text within some prior period (e.g., 5 seconds, the previous 8words, etc.) may be automatically presented to an AU.

It has been recognized that some types of calls can almost always beaccurately handled by an AVR engine. For instance, auto-attendant typecalls can typically be transcribed accurately via an AVR. For thisreason, in at least some embodiments, it is envisioned that a systemprocessor at the AU device or at the relay may be able to determine acall type (e.g., auto-attendant or not, or some other call typeroutinely accurately handled by an AVR engine) and automatically routecalls within the overall system to the best and most efficient/effectiveoption for text generation. Thus, for example, in a case where an AUdevice manages access to an AVR operated by a third party and accessiblevia an internet link, when an AU places a call that is received by anauto-attendant system, the AU device may automatically recognize theanswering system as an auto-attendant type and instead of transmittingthe auto-attendant voice signal to a relay for CA transcription, maytransmit the auto-attendant voice signal to the third party AVR enginefor text generation.

In this example, if the call type changes mid-stream during itsduration, the AU device may also transmit the received voice signal to aCA for captioning if appropriate. For instance, if an interactive voicerecognition auto-attendant system eventually routes the AU's call to alive person (e.g., a service representative for a company), once thelive person answers the call, the AU device processor may recognize theperson's voice as a non-auto-attendant signal and route that signal to aCA for captioning as well as to the AVR for voice model training. Inthese cases, the AVR engine may be specially tuned to transcribeauto-attendant voice signals to text and, when a live HU gets on theline, would immediately start training a voice model for that HU's voicesignal.

In cases or at times when HU voice signals are transcribed automaticallyto text via an AVR engine when a CA is only correcting AVR generatedtext, the relay may include a synchronizing function or capability sothat, as a CA listens to an HU's voice signal during an error correctionprocess, the associated text from the AVR is presented generallysynchronously to the CA with the HU voice signal. For instance, in somecases an AVR transcribed word may be visually presented via a CA display50 at substantially the same instant at which the word is broadcast tothe CA to hear. As another instance, the AVR transcribed word may bepresented one, two, or more seconds prior to broadcast of that word tothe CA.

In still other cases, the AVR generated text may be presented forcorrection via a CA display 50 immediately upon generation and, as theCA controls broadcast speed of the HU voice signal for correctionpurposes, the word or phrase instantaneously audibly broadcast may behighlighted or visually distinguished in some fashion. To this end, seeFIG. 23 where automated AVR generated text is shown at 748 where a wordinstantaneously audibly broadcast to a CA (see 752) is simultaneouslyhighlighted at 750. Here, as the words are broadcast via CA headset 54,the text representations of the words are highlighted or otherwisevisually distinguished to help the error correcting CA follow along.

In at least some cases an error correcting CA will be able to skip backand forth within the HU voice signal to control broadcast of the HUvoice signal to the CA. For instance, as described above, a CA may havea foot pedal useable to skip back in a buffered HU voice recording 5,10, etc., seconds to replay an HU voice signal recording. Here, when therecording skips back, the highlighted text in representation 748 wouldlikewise skip back to be synchronized with the broadcast words. To thisend, see FIG. 25 where, in at least some cases, a foot pedal activationmay cause the recording to skip back to the word “pizza” which is thenbroadcast as at 764 and highlighted in text 748 as shown at 762. Inother cases, the CA may simply single tap or otherwise select any wordpresented on display 50 to skip the voice signal play back andhighlighted text to that word. For instance, in FIG. 25 icon 766represents a single tap which causes the word “pizza” to be highlightedand substantially simultaneously broadcast. Other word selectinggestures (e.g., a mouse control click, etc.) are contemplated.

In some embodiments when a CA selects a text word to correct, the voicesignal replay may automatically skip to some word in the voice bufferrelative to the selected word and may halt voice signal replayautomatically until the correction has been completed. For instance, adouble tap on the word “pals’ in FIG. 23 may cause that word to behighlighted for correction and may automatically cause the point in theHU voice replay to move backward to a location a few words prior to theselected word “pals.” To this end, see in FIG. 25 that the word “Pete's”that is still highlighted as being corrected (e.g., the CA has notconfirmed a complete correction) has been typed in to replace the word“Pals” and the word “pizza” that precedes the word “Pete's” has beenhighlighted to indicate where the HU voice signal broadcast will againcommence after the correction at 760 has been completed. While backwardreplay skipping has been described, forward skipping is alsocontemplated.

In some cases, when a CA selects a word in presented text for correctionor at least to be considered for correction, the system may skip to alocation a few words prior to the selected word and may represent the HUvoice signal stating at that point and ending a few words after thatpoint to give a CA context in which to hear the word to be corrected.Thereafter, the system may automatically move back to a subsequent pointin the HU voice signal at which the CA was when the word to be correctedwas selected. For instance, again, in FIG. 25, assume that the HU voicebroadcast to a CA is at the word “catch” 761 when the CA selects theword “Pete's 760 for correction. In this case, the CA's interface mayskip back in the HU voice signal to the word pizza at 762 andre-broadcast the phrase parts from the word “pizza” to the word “want”763 to provide immediate context to the CA. After broadcasting the word“want”, the interface would skip back to the word “catch” 761 andcontinue broadcasting the HU voice signal from that point on.

In at least some embodiments where an AVR engine generates automatictext and a CA is simply correcting that text prior to transmission to anAU, the AVR engine may assign a confidence factor to each word generatedthat indicates how likely it is that the word is accurate. Here, in atleast some cases, the relay server may highlight any text on thecorrecting CA's display screen that has a confidence factor lower thansome threshold level to call that text to the attention of the CA forspecial consideration. To this end, see again FIG. 23 where variouswords (e.g., 777, 779, 781) are specially highlighted in theautomatically generated AVR text to indicate a low confidence factor.

While AU voice signals are not presented to a CA in most cases forprivacy reasons, it is believed that in at least some cases a CA mayprefer to have some type of indication when an AU is speaking to helpthe CA understand how a communication is progressing. To this end, in atleast some embodiments an AU device may sense an AU voice signal and atleast generate some information about when the AU is speaking. Thespeaking information, without word content, may then be transmitted inreal time to the CA at the relay and used to present an indication thatthe AU is speaking on the CA screen. For instance, see again FIG. 23where lines 783 are presented on display 50 to indicate that an AU isspeaking. As shown, lines 783 are presented on a right side of thedisplay screen to distinguish the AU's speaking activity from the textand other visual representations associated with the HU's voice signal.As another instance, when the AU speaks, a text notice 797 or somegraphical indicator (e.g., a talking head) may be presented on the CAdisplay 50 to indicate current speaking by an AU. While not shown it iscontemplated that some type of non-content AU speaking indication like783 may also be presented to an AU via the AU's device to help the AUunderstand how the communication is progressing.

It has been recognized that some third party AVR systems available viathe internet or the like tend to be extremely accurate for short voicesignal durations (e.g., 15-30 seconds) after which accuracy becomes lessreliable. To deal with AVR accuracy degradation during an ongoing call,in at least some cases where a third party AVR system is employed togenerate automated text, the system processor (e.g., at the relay, inthe AU device or in the HU device) may be programmed to generate aseries of automatic text transcription requests where each request onlytransmits a short sub-set of a complete HU voice signal. For instance, afirst AVR request may be limited to a first 15 seconds of HU voicesignal, a second AVR request may be limited to a next 15 seconds of HUvoice signal, a third AVR request may be limited to a third 15 secondsof HU voice signal, and so on. Here, each request would present theassociated HU signal to the AVR system immediately and continuously asthe HU voice signal is received and transcribed text would be receivedback from the AVR system during the 15 second period. As the text isreceived back from the AVR system, the text would be cobbled together toprovide a complete and relatively accurate transcript of the HU voicesignal.

While the HU voice signal may be divided into consecutive periods insome cases, in other cases it is contemplated that the HU voice signalslices or sub-periods sent to the AVR system may overlap at leastsomewhat to ensure all words uttered by an HU are transcribed and toavoid a case where words in the HU voice signal are split among periods.For instance, voice signal periods may be 30 seconds long and each mayoverlap a preceding period by 10 seconds and a following period by 10seconds to avoid split words. In addition to avoiding a split wordproblem, overlapping HU voice signal periods presented to an AVR systemallows the system to use context represented by surrounding words tobetter (e.g., contextually) covert HU voiced words to text. Thus, a wordat the end of a first 20 second voice signal period will be near thefront end of the overlapping portion of a next voice signal period andtherefore, typically, will have contextual words prior to and followingthe word in the next voice signal period so that a more accuratecontextually considered text representation can be generated.

In some cases, a system processor may employ two, three or moreindependent or differently tuned AVR systems to automatically generateautomated text and the processor may then compare the text results andformulate a single best transcript representation in some fashion. Forinstance, once text is generated by each engine, the processor may pollfor most common words or phrases and then select most common as text toprovide to an AU, to a CA, to a voice modeling engine, etc.

In most cases automated text (e.g., AVR generated text) will begenerated much faster than CA generated text or at least consistentlymuch faster. It has been recognized that in at least some cases anassisted user will prefer even uncorrected automated text to CAcorrected text where the automated text is presented more rapidlygenerated and therefore more in sync with an audio broadcast HU voicesignal. For this reason, in at least some cases, a different and morecomplex voice-to-text triage process may be implemented. For instance,when an AU-HU call commences and the AU requires text initially,automated AVR generated text may initially be provided to the AU. If agood HU voice model exists for the HU, the automated text may beprovided without CA correction at least initially. If the AU, a systemprocessor, or an HU determines that the automated text includes too manyerrors or if some other operating characteristic (e.g., line noise) thatmay affect text transcription accuracy is sensed, a next level of thetriage process may link an error correcting CA to the call and the AVRtext may be presented in essentially real time to the CA via display 50simultaneously with presentation to the AU via display 18.

Here, as the CA corrects the automated text, corrections areautomatically sent to the AU device and are indicated via display 18.Here, the corrections may be in-line (e.g., erroneous text replaced),above error, shown after errors, may be visually distinguished viahighlighting or the like, etc. Here, if too many errors continue topersist from the AU's perspective, the AU may select an AU device button(e.g., see 68 again in FIG. 1) to request full CA transcription.Similarly, if an error correcting CA perceives that the AVR engine isgenerating too many errors, the error correcting CA may perform someaction to initiate full CA transcription and correction. Similarly, arelay processor or even an AU device processor may detect that an errorcorrecting CA is having to correct too many errors in the AVR generatedtext and may automatically initiate full CA transcription andcorrection.

In any case where a CA takes over for an AVR engine to generate text,the AVR engine may still operate on the HU voice signal to generate textand use that text and CA generated text, including corrections, torefine a voice model for the HU. At some point, once the voice modelaccuracy as tested against the CA generated text reaches some thresholdlevel (e.g., 95% accuracy), the system may again automatically or at thecommand of the transcribing CA or the AU, revert back to the CAcorrected AVR text and may cut out the transcribing CA to reduce costs.Here, if the AVR engine eventually reaches a second higher accuracythreshold (e.g., 98% accuracy), the system may again automatically or atthe command of an error correcting CA or an AU, revert back to theuncorrected AVR text to further reduce costs.

In at least some cases it is contemplated that an AU device may allow anAU to set a personal preference between text transcription accuracy andtext speed. For instance, a first AU may have fairly good hearing andtherefore may only rely on a text transcript periodically to identify aword uttered by an HU while a second AU has extremely bad hearing andeffectively reads every word presented on an AU device display. Here,the first AU may prefer text speed at the expense of some accuracy whilethe second AU may require accuracy even when speed of text presentationor correction is reduced. An exemplary AU device tool is shown as anaccuracy/speed scale 770 in FIG. 18 where an accuracy/speed selectionarrow 772 indicates a current selected operating characteristic. Here,moving arrow 772 to the left, operating parameters like correction time,AVR operation etc., are adjusted to increase accuracy at the expense ofspeed and moving arrow 772 right on scale 770 increases speed of textgeneration at the expense of accuracy.

In at least some embodiments when text is presented to an errorcorrecting CA via a CA display 50, the text may be presented at leastslightly prior to broadcast of (e.g., ¼ to 2 seconds) an associated HUvoice signal. In this regard, it has been recognized that many CAsprefer to see text prior to hearing a related audio signal and link thetwo optimally in their minds when text precedes audio. In other casesspecific CAs may prefer simultaneous text and audio and still others mayprefer audio before text. In at least some cases it is contemplated thata CA workstation may allow a CA to set text-audio sync preferences. Tothis end, see exemplary text-audio sync scale 765 in FIG. 25 thatincludes a sync selection arrow 767 that can be moved along the scale tochange text-audio order as well as delay or lag between the two.

In at least some embodiments an on-screen tool akin to scale 765 andarrow 767 may be provided on an AU device display 18 to adjust HU voicesignal broadcast and text presentation timing to meet an AU'spreferences.

It has been recognized that some AU's can hear voice signals with aspecific characteristic set better than other voice signals. Forinstance, one AU may be able to hear low pitch traditionally male voicesbetter than high pitch traditionally female voice signals. In someembodiments an AU may perform a commissioning procedure whereby the AUtests capability to accurately hear voice signals having differentcharacteristics and results of those capabilities may be stored in asystem database. The hearing capability results may then be used toadjust or modify the way text captioning is accomplished. For instance,in the above case where an AU hears low pitch voices well but not highpitch voices, if a low pitch HU voice is detected when a call commences,the system may use the AVR function more rapidly than in the case of ahigh pitched voice signal. Voice characteristics other than pitch may beused to adjust text transcription and AVR transition protocols insimilar ways.

In at least some cases where an HU device like a smart phone, tablet,computing device, laptop, smart watch, etc., has the ability to storedata or to access data via the internet, a WIFI system or otherwise thatis stored on a local or remote (e.g., cloud) server, it is contemplatedthat every HU device or at least a subset used by specific HUs may storean HU voice model for an associated HU to be used by a captioningapplication or by any software application run by the HU device. Here,the HU model may be trained by one or more applications run on the HUdevice or by some other application like an AVR system associated withone of the captioning systems described herein that is run by an AUdevice, the relay server, or some third party server or processor. Here,for example, in one instance, an HU's voice model stored on an HU devicemay be used to drive a voice-to-text search engine input tool to providetext for an internet search independent of the captioning system. Themulti-use and perhaps multi-application trained HU voice model may alsobe used by a captioning AVR system during an AU-HU call. Here, the voicemodel may be used by an AVR application run on the HU device, run on theAU device, run by the relay server or run by a third party server.

In cases where an HU voice model is accessible to an AVR engineindependent of an HU device, when an AU device is used to place a callto an HU device, an HU model associated with the number called may beautomatically prepared for generating captions even prior to connectionto the HU device. Where a phone or other identifying number associatedwith an HU device can be identified prior to an AU answering a call fromthe HU device, again, an HU voice model associated with the HU devicemay be accessed and readied by the captioning system for use prior tothe answering action to expedite AVR text generation. Most people useone or a small number of phrases when answering an incoming phone call.Where an HU voice model is loaded prior to an HU answering a call, theAVR engine can be poised to detect one of the small number of greetingphrases routinely used to answer calls and to compare the HU's voicesignal to the model to confirm that the voice model is for the specificHU that answers the call. If the HU's salutation upon answering the calldoes not match the voice model, the system may automatically link to aCA to start a CA controlled captioning process.

While a captioning system must provide accurate text corresponding to anHU voice signal for an AU to view when needed, typical relay systems fordeaf and hard of hearing person would not provide a transcription of anAU's voice signal. Here, generally, the thinking has been that an AUknows what she says in a voice signal and an HU hears that signal andtherefore text versions of the AU's voice was not necessary. This,coupled with the fact that AU captioning would have substantiallyincreased the transcription burden on CAs (e.g., would have required CArevoicing or typing and correction of more voice signal (e.g., the AUvoice signal)) meant that AU voice signal transcription simply was notsupported. Another reason AU voice transcription was not supported wasthat at least some AUs, for privacy reasons, do not want both sides ofconversations with HUs being listened to by CAs.

In at least some embodiments, it is contemplated that the AU side of aconversation with an HU may be transcribed to text automatically via anAVR engine and presented to the AU via a device display 18 while the HUside of the conversation is transcribed to text in the most optimal waygiven transcription triage rules or algorithms as described above. Here,the AU voice captions and AU voice signal would never be presented to aCA. Here, while AU voice signal text may not be necessary in some cases,in others it is contemplated that many AUs may prefer that text of theirvoice signals be presented to be referred back to or simply as anindication of how the conversation is progressing. Seeing both sides ofa conversation helps a viewer follow the progress more naturally. Here,while the AVR generated AU text may not always be extremely accurate,accuracy in the AU text is less important because, again, the AU knowswhat she said.

Where an AVR engine automatically generates AU text, the AVR engine maybe run by any of the system processors or devices described herein. Inparticularly advantageous systems the AVR engine will be run by the AUdevice 12 where the software that transcribes the AU voice to text istrained to the voice of the AU and therefore is extremely accuratebecause of the personalized training.

Thus, referring again to FIG. 1, for instance, in at least someembodiments, when an AU-HU call commences, the AU voice signal may betranscribed to text by AU device 12 and presented as shown at 822 inFIG. 26 without providing the AU voice signal to relay 16. The HU voicesignal, in addition to being audibly broadcast via AU device 12, may betransmitted in some fashion to relay 16 for conversion to text when sometype of CA assistance is required. Accurate HU text is presented ondisplay 18 at 820. Thus, the AU gets to see both AU text, albeit withsome errors, and highly accurate HU text. Referring again to FIG. 24, inat least some cases, AU and HU text may also be presented to an HU viaan HU device (e.g., a smart phone) in a fashion similar to that shown inFIG. 26.

Referring still to FIG. 26, where both HU and AU text are generated andpresented to an AU, the HU and AU text may be presented in staggeredcolumns as shown along with an indication of how each textrepresentation was generated (e.g., see titles at top of each column inFIG. 26).

In at least some cases it is contemplated that an AU may, at times, noteven want the HU side of a conversation to be heard by a CA for privacyreasons. Here, in at least some cases, it is contemplated that an AUdevice may provide a button or other type of selectable activator toindicate that total privacy is required and then to re-establish relayor CA captioning and/or correction again once privacy is no longerrequired. To this end, see the “Complete Privacy” button or virtual icon826 shown on the AU device display 18 in FIG. 26. Here, it iscontemplated that, while an AU-HU conversation is progressing and a CAgenerates/corrects text 820 for an HU's voice signal and an AVRgenerates AU text 822, if the AU wants complete privacy but still wantsHU text, the AU would select icon 826. Once icon 826 is selected, the HUvoice signal would no longer be broadcast to the CA and instead an AVRengine would transcribe the AU voice signal to automated text to bepresented via display 18. Icon 826 in FIG. 26 would be changed to “CACaption” or something to that effect to allow the AU to again start fullCA assistance when privacy is less of a concern.

In addition to a voice-to-text lag exceeding a maximum lag time, theremay be other triggers for using AVR engine generated text to catch an AUup to an HU voice signal. For instance, in at least some cases an AUdevice may monitor for an utterance from an AU using the device and mayautomatically fill in AVR engine generated text corresponding to an HUvoice signal when any AU utterance is identified. Here, for example,where CA transcription is 30 seconds behind an HU voice signal, if an AUspeaks, it may be assumed that the AU has been listening to the HU voicesignal and is responding to the broadcast HU voice signal in real time.Because the AU responds to the up to date HU voice signal, there is noneed for an accurate text transcription for prior HU voice phrases andtherefore automated text may be used to automatically catch up. In thiscase, the CA's transcription task would simply be moved up in time to acurrent real time HU voice signal automatically and the CA would nothave to consider the intervening 30 seconds of HU voice fortranscription or even correction.

As another example, when an AU device or other system device recognizesa turn marker in an HU voice signal, all AVR generated text that isassociated with a lag time may be filled in immediately andautomatically.

As still one other instance, an AU device or other device may monitor AUutterances for some specific word or phrase intended to trigger anupdate of text associated with a lag time. For instance, the AU maymonitor for the word “Update” and, when identified, may fill in the lagtime with automated text. Here, in at least some cases, the AU may beprogrammed to cancel the catch-up word “Update” from the AU voice signalsent to the HU device. Thus, here, the AU utterance “Update” would havethe effect of causing AVR text to fill in a lag time without beingtransmitted to the HU device. Other commands may be recognized andautomatically removed from the AU voice signal.

Thus, it should be appreciated that various embodiments of asemi-automated automatic voice recognition or text transcription systemto aid hearing impaired persons when communicating with HUs have beendescribed. In each system there are at least three entities and at leastthree devices and in some cases there may be a fourth entity and anassociated fourth device. In each system there is at least one HU andassociated device, one AU and associated device and one relay andassociated device or sub-system while in some cases there may also be athird party provider (e.g., a fourth party) of AVR services operatingone or more servers that run AVR software. The HU device, at a minimum,enables an HU to annunciate words that are transmitted to an AU deviceand receives an AU voice signal and broadcasts that signal audibly forthe HU to hear.

The AU device, at a minimum, enables an AU to annunciate words that aretransmitted to an HU device, receives an HU voice signal and broadcaststhat signal audibly for the AU to attempt to hear, receives or generatestranscribed text corresponding to an HU voice signal and displays thetranscribed text to an AU on a display to view.

The relay, at a minimum, at times, receives the AU voice signal andgenerates at least corrected text that may be transmitted to anothersystem device.

In some cases where there is no fourth party AVR system, any of theother functions/processes described above may be performed by any of theHU device, AU device and relay server. For instance, the HU device insome cases may store an HU voice model and/or voice characteristicsmodel, an AVR application and a software program for managing whichtext, AVR or CA generated, is used to drive an AU device. Here, the HUmay link directly with each of the AU device and relay, and may operateas an intermediary therebetween.

As another instance, HU models, AVR software and caption controlapplications may be stored and used by the AU device processor or,alternatively, by the relay server. In still other instances differentsystem components or devices may perform different aspects of afunctioning system. For instance, an HU device may store an HU voicemodel which may be provided to an AU device automatically at thebeginning of a call and the AU device may transmit the HU voice modelalong with a received HU voice signal to a relay that uses the model totune an AVR engine to generate automated text as well as provides the HUvoice signal to a first CA for revoicing to generate CA text and asecond CA for correcting the CA text. Here, the relay may transmit andtranscribe text (e.g., automated and CA generated) to the AU device andthe AU device may then select one of the received texts to present viathe AU device screen. Here CA captioning and correction and transmissionof CA text to the AU device may be halted in total or in part at anytime by the relay or, in some cases, by the AU device, based on variousparameters or commands received from any parties (e.g., AU, HU, CA)linked to the communication.

In cases where a fourth party to the system operates an AVR engine inthe cloud or otherwise, at a minimum, the AVR engine receives an HUvoice signal at least some of the time and generates automated textwhich may or may not be used at times to drive an AU device display.

In some cases it is contemplated that AVR engine text (e.g., automatedtext) may be presented to an HU while CA generated text is presented toan AU and a most recent word presented to an AU may be indicated in thetext on the HU device so that the HU has a good sense of how far behindan AU is in following the HU's voice signal. To this end, see FIG. 27that shows an exemplary HU smart phone device 800 including a display801 where text corresponding to an HU voice signal is presented for theHU to view at 848. The text 848 includes text already presented to an AUprior to and including the word “after” that is shown highlighted 850 aswell as AVR engine generated text subsequent to the highlight 850 that,in at least the illustrated embodiment, may not have been presented tothe AU at the illustrated time. Here, an HU viewing display 801 can seewhere the AU is in receiving text corresponding to the HU voice signal.The HU may use the information presented as a coaching tool to help theHU regulate the speed at which the HU converses.

To be clear, where an HU device is a smart phone or some other type ofdevice that can run an application program to participate in acaptioning service, many different linking arrangements between the AU,HU and a relay are contemplated. For instance, in some cases the AU andHU may be directly linked and there may be a second link or line fromthe AU to the relay for voice and data transmission when necessarybetween those two entities. As another instance, when an HU and AU arelinked directly and relay services are required after the initial link,the AU device may cause the HU device to link directly to the relay andthe relay may then link to the AU device so that the relay is locatedbetween the AU and HU devices and all communications pass through therelay. In still another instance, an HU device may link to the relay andthe relay to the AU device and the AU device to the HU device so thatany communications, voice or data, between two of the three entities isdirect without having to pass through the other entity (e.g., HU and AUvoice signals would be directly between HU and AU devices, HU voicesignal would be direct from the HU device to the relay and transcribedtext associated with the HU voice would be directly passed from therelay to the AU device to be displayed to the AU. Here, any textgenerated at the relay to be presented via the HU device would betransmitted directly from the relay to the HU device and any textgenerated by either one of the AU or HU devices (e.g., via an ARVengine) would be directly transmitted to the receiving device. Thus, anHU device or captioning application run thereby may maintain a directdial number or address for the relay and be able to link up to the relayautomatically when CA or other relay services are required.

Referring now to FIG. 28, a schematic is shown of an exemplarysemi-automated captioning system that is consistent with at least someaspects of the present disclosure. The system enables an HU using device14 to communicate with an AU using AU device 12 where the AU receivestext and HU voice signals via the AU device 12. Each of the HU and theAU link into a gateway server or other computing device 900 that islinked via a network of some type to a relay. HU voice signals are fedthrough a noise reducing audio optimizer to a 3 pole or path AVR switchdevice 904 that is controlled by an adaptive AVR switch controller 932to select one of first, second and third text generating processesassociated with switch output leads 940, 942 and 944, respectively. Thefirst text generating process is an automated AVR text process whereinan AVR engine generates text without any input (e.g., data entry,correction, etc.) from any CA. The second text generating process is aprocess wherein a CA 908 revoices an HU voice or types to generate textcorresponding to an HU voice signal and then corrects that text. Thethird text generating process is one wherein the AVR engine generatesautomated text and a correcting CA 912 makes corrections to theautomated text. In the second process, the AVR engine operates inparallel with the CA to generate automated text in parallel to the CAgenerated and corrected text.

Referring still to FIG. 28, with switch 904 connected to output lead940, the HU voice signal is only presented to AVR engine 906 whichgenerates automated text corresponding to the HU voice which is thenprovided to a voice to text synchronizer 910. Here, synchronizer 908simply passes the raw AVR text on through a correctable text window 916to the AU device 12.

Referring again to FIG. 28, with switch 904 connected to output lead942, the HU voice signal, in addition to being linked to the AVR engine,is presented to CA 908 for generating and correcting text viatraditional CA voice recognition 920 and manual correction tools 924 viacorrection window 922. Here, corrected text is provided to the AU device12 and is also provided to a text comparison unit or module 930. Rawtext from the AVR engine 906 is presented to comparison unit 930.Comparison unit 930 compares the two text streams received andcalculates an AVR error rate which is output to switch control 932.Here, where the AVR error rate is low (e.g., below some threshold),control 932 may be controlled to cut the text generating CA 908 out ofthe captioning process.

Referring still to FIG. 28, with switch 904 connected to output lead944, the HU voice signal, in addition to being linked to the AVR engine,is fed through synchronizer 910 which delays the HU voice signal so thatthe HU voice signal lags the raw AVR text by a short period (e.g., 2seconds). The delayed HU voice signal is provided to a CA 912 chargedwith correcting AVR text generated by engine 906. The CA 912 uses akeyboard or the like 914 to correct any perceived errors in the raw AVRtext presented in window 916. The corrected text is provided to the AUdevice 12 and is also provided to the text comparison unit 930 forcomparison to the raw AVR text. Again, comparison unit 930 generates anAVR error rate which is used by control 932 to operate switch device904. The manual corrections by CA 912 are provided to a CA errortracking unit 918 which counts the number of errors corrected by the CAand compares that number to the total number of words generated by theAVR engine 906 to calculate a CA correction rate for the AVR generatedraw text. The correction rate is provided to control 932 which uses thatrate to control switch device 904.

Thus, in operation, when an HU-AU call first requires captioning, in atleast some cases switch device 904 will be linked to output lead 942 sothat full CA transcription and correction occurs in parallel with theAVR engine generating raw AVR text for the HU voice signal. Here, asdescribed above, the AVR engine may be programmed to compare the raw AVRtext and the CA generated text and to train to the HU's voice signal sothat, over a relatively short period, the error rate generated bycomparison unit 930 drops. Eventually, once the error rate drops belowsome rate threshold, control 932 controls device 940 to link to outputlead 944 so that CA 908 is taken out of the captioning path and CA 912is added. CA 912 receives the raw AVR text and corrects that text whichis sent on to the AU device 12. As the CA corrects text, the AVR enginecontinues to train to the HU voice using the corrected errors.Eventually, the AVR accuracy should improve to the point where thecorrection rate calculated by tracking unit 918 is below some threshold.Once the correction rate is below the threshold, control 932 may controlswitch 904 to link to output link 940 to take the CA 912 out of thecaptioning loop which causes the relatively accurate raw AVR text to befed through to the AU device 12. As described above in at least somecases the AU and perhaps a CA or the HU may be able to manually switchbetween captioning processes to meet preferences or to address perceivedcaptioning problems.

As described above, it has been recognized that at least some AVRengines are more accurate and more resilient during the first 30+/−seconds of performing voice to text transcription. If an HU takes aspeaking turn that is longer than 30 seconds the engine has a tendencyto freeze or lag. To deal with this issue, in at least some embodiments,all of an HU's speech or voice signal may be fed into an audio bufferand a system processor may examine the HU voice signal to identify anysilent periods that exceed some threshold duration (e.g., 2 seconds).Here, a silent period would be detected whenever the HU voice signalaudio is out of a range associated with a typical human voice. When asilent period is identified, in at least some cases the AVR engine isrestarted and a new AVR session is created. Here, because the processuses an audio buffer, no portion of the HU's speech or voice signal islost and the system can simply restart the AVR engine after theidentified silent period and continue the captioning process afterremoving the silent period.

Because the AVR engine is restarted whenever a silent period of at leasta threshold duration occurs, the system can be designed to have severaladvantageous features. First, the system can implement a dynamic andconfigurable range of silence or gap threshold. For instance, in somecases, the system processor monitoring for a silent period of a certainthreshold duration can initially seek a period that exceeds some optimalrelatively long length and can reduce the length of the thresholdduration as the AVR captioning process nears a maximum period prior torestarting the engine. Thus, for instance, where a maximum AVR enginecaptioning period is 30 seconds, initially the silent period thresholdduration may be 3 seconds. However, after an initial 20 seconds ofcaptioning by an engine, the duration may be reduced to 1.5 seconds.Similarly, after 25 seconds of engine captioning, the threshold durationmay be reduced further to one half a second.

As another instance, because the system uses an audio buffer in thiscase, the system can “manufacture” a gap or silent period in which torestart an AVR engine, holding an HU's voice signal in the audio bufferuntil the AVR engine starts captioning anew. While the manufacturedsilent period is not as desirable as identifying a natural gap or silentperiod as described above, the manufactured gap is a viable option ifnecessary so that the AVR engine can be restarted without loss of HUvoice signal.

In some cases it is contemplated that a hybrid silent period approachmay be implemented. Here, for instance, a system processor may monitorfor a silent period that exceeds 3 seconds in which to restart an AVRengine. If the processor does not identify a suitable 3-plus secondperiod for restarting the engine within 25 seconds, the processor maywait until the end of any word and manufacture a 3 second period inwhich to restart the engine.

Where a silent period longer than the threshold duration occurs and theAVR engine is restarted, if the engine is ready for captioning prior tothe end of the threshold duration, the processor can take out the end ofthe silent period and begin feeding the HU voice signal to the AVRengine prior to the end of the threshold period. In this way, theprocessor can effectively eliminate most of the silent period so thatcaptioning proceeds quickly.

Restarting an AVR engine at various points within an HU voice signal hasthe additional benefit of making all hypothesis words (e.g., initiallyidentified words prior to contextual correction based on subsequentwords) firm. Doing so allows a CA correcting the text to makecorrections or any other manipulations deemed appropriate for an AUimmediately without having to wait for automated contextual corrections.

In still other cases other hybrid systems are contemplated where aprocessor examines an HU voice signal for suitably long silent periodsin which to restart an AVR engine and, where no such period occurs by acertain point in a captioning process, the processor commences anotherAVR engine captioning process which overlaps the first process so thatno HU voice signal is lost. Here, the processor would work out whichcaptioned words are ultimately used as final AVR output during theoverlapping periods to avoid duplicative or repeated text.

One other feature that may be implemented in some embodiments of thisdisclosure is referred to as a Return On Audio detector (ROA-Detector)feature. In this regard, a system processor receiving an HU voice signalascertains whether or not the signal includes audio in a range that istypical for human speech during an HU turn and generates a duration ofspeech value equal to the number of seconds of speech received. Thus,for instance, in a ten second period corresponding to an HU voice signalturn, there may be 3 seconds of silence during which audio is not in therange of typical human speech and therefore the duration of speech valuewould be 7 seconds. In addition, the processor detects the quantity ofcaptions being generated by an AVR engine. The processor automaticallycompares the quantity of captions from the AVR with the duration ofspeech value to ascertain if there is a problem with the AVR engine.Thus, for instance, if the quantity of AVR generated captions issubstantially less than would be expected given the duration of speechvalue, a potential AVR problem may be identified. Where an AVR problemis likely, the likely problem may be used by the processor to trigger arestart of the AVR engine to generate a better result. As analternative, where an AVR problem is likely, the problem may triggerinitiation of a whole new AVR session. As still one other alternative, alikely AVR problem may trigger a process to bring a CA on lineimmediately or more quickly than would otherwise be the case.

In still other cases, when an AVR error is detected as indicated above,the ROA detector may retrieve the audio (i.e., the HU voice signal) thatwas originally sent to the AVR from a rolling buffer and replay/resendthe audio to the AVR engine. This replayed audio would be sent through aseparate session simultaneously with any new sessions that are sendingongoing audio to the AVR. Here, the captions corresponding to thereplayed audio would be sent to the AU device and inserted into acorrect sequential slot in the captions presented to the AU. Inaddition, here, the ROA detector would monitor the text that comes backfrom the AVR and compare that text to the text retrieved during theprior session, modifying the captions to remove redundancies. Anotheroption would be for the ROA to simply deliver a message to the AU deviceindicating that there was an error and that a segment of audio was notproperly captioned. Here, the AU device would present the likelyerroneous captions in some way that indicates a likely error (e.g.,perhaps visually distinguished by a yellow highlight or the like).

In some cases it is contemplated that a phone user may want to have justin time (JIT) captions on their phone or other communication device(e.g., a tablet) during a call with an HU for some reason. For instance,when a smart phone user wants to remove a smart phone from her ear for ashort period the user may want to have text corresponding to an HU'svoice presented during that period. Here, it is contemplated that avirtual “Text” or “Caption” button may be presented on the smart phonedisplay screen or a mechanical button may be presented on the devicewhich, when selected causes an AVR to generate text for a preset periodof time (e.g. 10 seconds) or until turned off by the device user. Here,the AVR may be on the smart phone device itself, may be at a relay or atsome other device (e.g., the HU's device).

While HU voice profiles may be developed and stored for any HU callingan AU, in some embodiments profiles may only be stored for a small setof HUs, such as, for instance, a set of favorites or contacts of an AU.For instance, where an AU has a list of ten favorites, HU voice profilesmay be developed, maintained, and morphed over time for each of thosefavorites. Here, again, the profiles may be stored at differentlocations and by different devices including the AU device, a relay, viaa third party service provider, or even an HU device where the HUearmarks certain AUs as having the HU as a favorite or a contact.

In some cases it may be difficult technologically for a CA to correctAVR captions. Here, instead of a CA correcting captions, another optionwould simply be for a CA to mark errors in AVR text as wrong and movealong. Here, the error could be indicated to an AU via the display on anAU's device. In addition, the error could be used to train an HU voiceprofile and/or captioning model as described above. As anotheralternative, where a CA marks a word wrong, a correction engine maygenerate and present a list of alternative words for the CA to choosefrom. Here, using an on screen tool, the CA may select a correct wordoption causing the correction to be presented to an AU as well ascausing the AVR to train to the corrected word.

In at least some cases it is contemplated that it may be useful to runperiodic tests on CA generated text captions to track CA accuracy orreliability over time. For instance, in some cases CA reliabilitytesting can be used to determine when a particular CA could useadditional or specialized training. In other cases, CA reliabilitytesting may be useful for determining when to cut a CA out of a call tobe replaced by automatic speech recognition (ASR) generated text. Inthis regard, for instance, if a CA is less reliable than an ASRapplication for at least some threshold period of time, a systemprocessor may automatically cut the CA out even if ASR quality remainsbelow some threshold target quality level if the ASR quality ispersistently above the quality of CA generated text. As anotherinstance, where CA quality is low, text from the CA may be fed to asecond CA for either a first or second round of corrections prior totransmission to an AU device for display or, a second relatively moreskilled CA trained in handling difficult HU voice signals may be swappedinto the transcription process in order to increase the quality level ofthe transcribed text. As still one other instance, CA reliabilitytesting may be useful to a governing agency interested in tracking CAaccuracy for some reason.

In at least some cases it has been recognized that in addition toassessing CA captioning quality, it will be useful to assess howaccurately an automated speech recognition system can caption the sameHU voice signal regardless of whether or not the quality values are usedto switch the method of captioning. For instance, in at least some casesline noise or other signal parameters may affect the quality of HU voicesignal received at a relay and therefore, a low CA captioning qualitymay be at least in part attributed to line noise and other signalprocessing issues. In this case, an ASR quality value for ASR generatedtext corresponding to the HU voice signal may be used as an indicationof other parameters that affect CA captioning quality and therefore inpart as a reason or justification for a low CA quality value. Forinstance, where an ASR quality value is 75% out of 100% and a CA qualityvalue is 87% out of 100%, the low ASR quality value may be used to showthat, in fact, given the relatively higher CA quality value, that the CAvalue is quite good despite being below a minimum target threshold. Linenoise and other parameters may be measured in more direct ways via linesensors at a relay or elsewhere in the system and parameter valuesindicative of line noise and other characteristics may be stored alongwith CA quality values to consideration when assessing CA quality.

Several ways to test CA accuracy and generate accuracy statistics arecontemplated by the present disclosure. One system for testing andtracking accuracy may include a system where actual or simulated HU-AUcalls are recorded for subsequent testing purposes and where HU turns(e.g., voice signal periods) in each call are transcribed and correctedby a CA to generate a true and highly accurate (e.g., approximately 100%accurate) transcription of the HU turns that is referred to hereinafteras the “truth”.

During testing, without a CA knowing, the recording is played for the CAwho perceives the recording to be a typical HU-AU call. In many cases, alarge number of recorded calls may be generated and stored for use bythe testing system so that a CA never listens to the same test recordingmore than once. In some cases a system processor may track CAs and whichtest recordings the CA has been exposed to previously and may ensurethat a CA only listens to any test recording once.

As a CA listens to a test recording, the CA transcribes the HU voicesignal to text and, in at least some cases, makes corrections to thetext. Because the CA generated text corresponds to a recorded voicesignal and not a real time signal, the text is not forwarded to an AUdevice for display. The CA is unaware that the text is not forwarded tothe AU device as this exercise is a test. The CA generated text iscompared to the truth and a quality value is generated for the CAgenerated text (hereinafter a “CA quality value”). For instance, the CAquality value may be a percent accuracy representing the percent of HUvoice signal words accurately transcribed to text. The CA quality valueis then stored in a data base for subsequent access.

In addition to generating a CA quality value that represents howaccurately a CA transcribes voice to text, in at least some cases thesystem will be programmed to track and record transcription latency thatcan be used as a second type of quality factor referred to hereinafteras the “CA latency value”. Here, the system may track instantaneouslatency and use the instantaneous values to generate average and otherstatistical latency values. For instance, an average latency over anentire call may be calculated, an average latency over a most recent oneminute period may be calculated, a maximum latency during a call, aminimum latency during a call, a latency average taking out the mostlatent 20% and least latent 20% of a call may be calculated and stored,etc. In some cases where both a CA quality value and CA latency valuesare generated, the system may combine the quality and latency valuesaccording to some algorithm to generate an overall CA service value thatreflects the combination of accuracy and latency.

CA latency may also be calculated in other ways. For instance, in atleast some cases a relay server may be programmed to count the number ofwords during a period that are received from an ASR service provider(see 1006 in FIG. 30) and to assume that the returned number of wordsrepresents the actual words per minute (WPM) spoken by an HU. Here,periods of HU silence may be removed from the period so that the wordcount more accurately reflects WPM of the speaking HU. Then, the numberof words generated by a CA for the same period may be counted and usedalong with the period duration minus silent periods to determine a CAWPM count. The server may then compare the speaker WPM to the CA WPMcount to assess CA delay or latency.

In at least some cases the recorded call may also be provided to an ASRto generate automatic text. The ASR generated text may also be comparedto the truth and an “ASR quality value” may be generated. The ASRquality value may be stored in a database for subsequent use or may becompared to the CA quality value to assess which quality value is higheror for some other purpose. Here, also, an ASR latency value or ASRlatency values (e.g., max, min, average over a call, average over a mostrecent period, etc.) may be generated as well as an overall ASR servicevalue. Again, the ASR and CA values may be used by a system processor todetermine when the ASR generated text should be swapped in for the CAgenerated text and vice versa.

Referring now to FIG. 29, an exemplary system 1000 for testing andtracking CA and AVR quality and latency values using recorded HU-AUcalls is illustrated. System 1000 includes relay components representedby the phantom box at 1001 and a cloud based ASR system 1006 (e.g., aserver that is linked to via the internet or some other type ofcomputing network). Two sources of pre-generated information aremaintained at the relay including a set of recorded calls at 1002 and aset of verified true transcripts at 1010, one truth or true transcriptfor each recorded call in the set 1002. Again, the recorded calls mayinclude actual HU-AU calls or may include mock calls that occur betweentwo knowing parties that simulate an actual call.

During testing, a connection is linked from a system server that storesthe calls 1002 to a captioning platform as shown at 1004 and one of therecorded calls, hereinafter referred to as a test recording, istransmitted to the captioning platform 1004. The captioning platform1004 sends the received test recording to two targets including a CA at1008 and the ASR server 1006 (e.g., Google Voice, IBM's Watson, etc.).The ASR generates an automated text transcript that is forwarded on to afirst comparison engine at 1012. Similarly, the CA generates CAgenerated text which is forwarded on to a second comparison engine 1014.The verified truth text transcript at 1010 is provided to each of thefirst and second comparison engines 1012 and 1014. The first engine 1012compares the ASR text to the truth and generates an ASR quality valueand the second engine 1014 compares the CA generated text to truth andgenerates a CA quality value, each of which are provided to a systemdatabase 1016 for storage until subsequently required.

In addition, in some cases, some component within the system 1000generates latency values for each of the ASR text and the CA generatedtext by comparing when the times at which words are uttered in the HUvoice signal to the times at which the text corresponding thereto isgenerated. The latency values are represented by clock symbols 1003 and1005 in FIG. 29. The latency values are stored in the database 1016along with the associated ASR and CA quality values generated by thecomparison engines 1012 and 1014.

Another way to test CA quality contemplated by the present disclosure isto use real time HU-AU calls to generate quality and latency values. Inthese cases, a first CA may be assigned to an ongoing HU-AU call and mayoperate in a conventional fashion to generate transcribed text thatcorresponds to an HU voice signal where the transcribed text istransmitted back to the AU device for display substantiallysimultaneously as the HU voice is broadcast to the AU. Here, the firstCA may perform any process to convert the HU voice to text such as, forinstance, revoicing the HU voice signal to a processor that runs voiceto text software trained to the voice of the HU to generate text andthen correcting the text on a display screen prior to sending the textto the AU device for display. In addition, the CA generated text is alsoprovided to a second CA along with the HU voice signal and the second CAlistens to the HU voice signal and views the text generated by the firstCA and makes corrections to the first CA generated text. Having beencorrected a second time, the text generated by the second CA is asubstantially error free transcription of the HU voice signal referredto hereinafter as the “truth”. The truth and the first CA generated textare provided to a comparison engine which then generates a “CA qualityvalue” similar to the CA quality value described above with respect toFIG. 29 which is stored for subsequent access in a database.

In addition, as is the case in FIG. 29, in the case of transcribing anongoing HU-AU call, the HU voice signal may also be provided to a cloudbased ASR server or service to generate automated speech recognitiontext during an ongoing call that can be compared to the truth (e.g., thesecond CA generated text) to generate an ASR quality value. Here, whileconventional ASRs are fast, there will again be some latency in textgeneration and the system will be able to generate an ASR latency value.

Referring now to FIG. 30, an exemplary system 1020 for testing andtracking CA and AVR quality and latency values using ongoing HU-AU callsis illustrated. Components in the FIG. 30 system 1020 that are similarto the components described above with respect to FIG. 29 are labeledwith the same numbers and operate in a similar fashion unless indicatedotherwise hereafter. In addition to an HU communication device 1040 andan AU communication device 1042 (e.g., a caption type telephone device),system 1020 includes relay components represented by the phantom box at1021 and a cloud based ASR system 1006 akin to the cloud based systemdescribed above with respect to FIG. 29. Here there is no pre-generatedand recorded call or pre-generated truth text as testing is done usingan ongoing dynamic call. Instead, a second CA at 1030 corrects textgenerated by a first CA at 1008 to create a truth (e.g., essentially100% accurate text). The truth is compared to ASR generated text and thefirst CA generated text to create quality values to be stored indatabase 1016.

Referring still to FIG. 30, during testing, as in a conventional relayassisted captioning system, the AU device 1042 transmits an HU voicesignal to the captioning platform at 1004. The captioning platform 1004sends the received HU voice signal to two targets including a first CAat 1008 and the ASR server 1006 (e.g., Google Voice, IBM's Watson,etc.). The ASR generates an automated text transcript that is forwardedon to a first comparison engine at 1012. Similarly, the first CAgenerates CA generated text which is transmitted to at least threedifferent targets. First, the first CA generated text which may includetext corrected by the first CA is transmitted to the AU device 1042 fordisplay to the AU during the call. Second, the first CA generated textis transmitted to the second comparison engine 1014. Third, the first CAgenerated text is transmitted to a second CA at 1030. The second CA at1030 views the CA generated text on a display screen and also listens tothe HU voice signal and makes corrections to the first CA generated textwhere the second CA generated text operates as a truth text or truth.The truth is transmitted to the second comparison engine at 1014 to becompared to the first CA generated text so that a CA quality value canbe generated. The CA quality value is stored in database 1016 along withone or more CA latency values.

Referring again to FIG. 30, the truth is also transmitted from thesecond call assistant at 1030 to the first comparison engine at 1012 tobe compared to the ASR generated text so that an ASR quality value isgenerated which is also stored along with at least one ASR latency valuein the database 1016.

Referring to FIG. 31, another embodiment of a testing relay system isshown at 1050 which is similar to the system 1020 of FIG. 30, albeitwhere the ASR service 1006 provides an initial text transcription to thesecond CA at 1052 instead of the CA receiving the initial text from thefirst call assistant. Here, the second CA generated the truth text whichis again provided to the two comparison engines at 1012 and 1014 so thatASR and CA quality factors can be generated to be stored in database1016.

The ASR text generation and quality testing processes are describedabove as occurring essentially in real time as a first CA generates textfor a recorded or ongoing call. Here, real time quality and latencytesting may be important where a dynamic triage transcription process isoccurring where, for instance, ASR generated text may be swapped in fora cut out CA when ASR generated text achieves some quality threshold ora CA may be swapped in for ASR generated text if the ASR quality valuedrops below some threshold level. In other cases, however, qualitytesting may not need to be real time and instead, may be able to be doneoff line for some purposes. For instance, where quality testing is onlyused to provide metrics to a government agency, the testing may be doneoff line.

In this regard, referring again to FIG. 29, in at least some cases wheretesting cannot be done on the fly as a CA at 1008 generates text, the CAtext and the recorded HU voice signal associated therewith may be storedin database 1016 for subsequent access for generating the ASR text at1006 as well as for comparing the CA generated text and the ASRgenerated text to the verified truth text from 1010. Similarly,referring again to FIG. 30, where real time quality and latency valuesare not required, at least the HU portion of a call may be stored indatabase 1016 for subsequent off line processing by ASR service 1006 andthe second CA at 1030 and then for comparisons to the truth at engines1012 an 1014.

One advantage of generating quality and latency values in real timeusing real HU-AU calls is that there is no need to store calls forsubsequent processing. Currently there are regulations in at least somejurisdictions that prohibit storing calls for privacy reasons andtherefore off line quality testing cannot be done in these cases.

In at least some embodiments it is contemplated that quality and latencytesting may only be performed sporadically and generally randomly sothat generated values are sort of an average representation of theoverall captioning service. In other cases, while quality and latencytesting may be periodic in general, it is contemplated that tell tailsigns of poor quality during transcription may be used to triggeradditional quality and latency testing. For instance, in at least somecases where an AU is receiving ASR generated text and the AU selects anoption to link to a CA for correction, the AU request may be used as atrigger to start the quality testing process on text received from thatpoint on (e.g., quality testing will commence and continue for HU voicereceived as time progresses forward). Similarly, when an AU requestsfull CA captioning (e.g., revoicing and text correction), qualitytesting may be performed from that point forward on the CA generatedtext.

In other cases, it is contemplated that an HU-AU call may be storedduring the duration of the call and that, at least initially, no qualitytesting may occur. Then, if an AU requests CA assistance, in addition topatching a CA into the call to generate higher quality transcription,the system may automatically patch in a second CA that generates truthtext as in FIG. 30 for the remainder of the call. In addition orinstead, when the AU requests CA assistance, the system may, in additionto patching a CA in to generate better quality text, also cause therecorded HU voice prior to the request to be used by a second CA togenerate truth text for comparison to the ASR generated text so that anASR quality value for the text that caused the AU to request assistancecan be generated. Here, the pre-CA assistance ASR quality value may begenerated for the entire duration of the call prior to the request orjust for a most recent sub-period (e.g., for the prior minute or 30seconds). Here, in at least some cases, it is contemplated that thesystem may automatically erase any recorded portion of an HU-AU callimmediately after any quality values associated therewith have beencalculated. In cases where quality values are only calculated for a mostrecent period of HU voice signal, recordings prior thereto may be erasedon a rolling basis.

As another instance, in at least some cases it is contemplated thatsensors at a relay may sense line noise or other signal parameters and,whenever the line noise or other parameters meet some threshold level,the system may automatically start quality testing which may persistuntil the parameters no longer meet the threshold level. Here, there maybe hysteresis built into the system so that once a threshold is met, atleast some duration of HU voice signal below the threshold is requiredto halt the testing activities. The parameter value or condition orcircumstance that triggered the quality testing would, in this case, bestored along with the quality value and latency information to addcontext to why the system started quality testing in the specificinstance.

As one other example, in a case where an AU signals dissatisfaction witha captioning service at the end of a call, quality testing may beperformed on at least a portion of the call. To this end, in at leastsome cases as an HU-AU call progresses, the call may be recordedregardless of whether or not ASR or CA generated text is presented to anAU. Then, at the end of a call, a query may be presented to the AUrequesting that the AU rate the AU's satisfaction with the call andcaptioning on some scale (e.g., a 1 through 10 quality scale with 10being high). Here, if a satisfaction rating were low (e.g., less than 7)for some reason, the system may automatically use the recorded HU voiceor at least a portion thereof to generate a CA quality value in one ofthe ways described above. For instance, the system may provide the textgenerated by a first CA or by the ASR and the recorded HU voice signalto a second CA for generating truth and a quality value may be generatedusing the truth text for storage in the database.

In still other cases where an AU expresses a low satisfaction rating fora captioning service, prior to using a recorded HU voice signal togenerate a quality value, the system server may request authorization touse the signal to generate a captioning quality value. For instance,after an AU indicates a 7 or lower on a satisfaction scale, the systemmay query the AU for authorization to check captioning quality byproviding a query on the AU's device display and “Yes” and “No” options.Here, if the yes option is selected, the system would generate thecaptioning quality value for the call and memorialize that value in thesystem database 1016.

As another instance, because it is the HU's voice signal that isrecorded (e.g., in some cases the AU voice signal may not be recorded)and used to generate the captioning quality value, authorization to usethe recording to generate the quality value may be sought from an HU ifthe HU is using a device that can receive and issue an authorizationrequest at the end of a call. For instance, in the case of a call wherean HU uses a standard telephone, if an AU indicates a low satisfactionrating at the end of a call, the system may transmit an audio recordingto the HU requesting authorization to use the HU voice signal togenerate the quality value along with instructions to select “1” for yesand “2” for no. In other cases where an HU's device is a smart phone orother computing type device, the request may include text transmitted tothe HU device and selectable “Yes” and “No” buttons for authorizing ornot.

While an HU-AU call recording may be at least temporarily stored at arelay, in other cases it is contemplated that call recordings may bestored at an AU device or even at an HU device until needed to generatequality values. In this way, an HU or AU may exercise more control or atleast perceive to exercise more control over call content. Here, forinstance, while a call may be recorded, the recording device may notrelease recordings unless authorization to do so is received from adevice operator (e.g., an HU or an AU). Thus, for instance, if the HUvoice signal for a call is stored on an HU device during the call and,at the end of a call an AU expresses low satisfaction with thecaptioning service in response to a satisfaction query, the system mayquery the HU to authorize use of the HU voice to generate captioningquality values. In this case, if the HU authorizes use of the HU voicesignal, the recorded HU voice signal would be transmitted to the relayto be used to generate captioning quality values as described above.Thus, the HU or AU device may serve as a sort of software vault for HUvoice signal recordings that are only released to the relay after properauthorization is received from the HU or the AU, depending on systemrequirements.

As generally known in the industry, voice to text software accuracy ishigher for software that is trained to the voice of a speaking person.Also known is that software can train to specific voices over shortdurations. Nevertheless, in most cases it is advantageous if softwarestarts with a voice model trained to a particular voice so that captionaccuracy can start immediately upon transcription. Thus, for instance,in FIG. 30, when a specific HU calls an AU to converse, it would beadvantageous if the ASR service at 1006 had access to a voice model forthe specific HU. One way to do this would be to have the ASR service1006 store voice models for at least HUs that routinely call an AU(e.g., a top ten HU list for each AU) and, when an HU voice signal isreceived at the ASR service, the service would identify the HU voicesignal either using recognition software that can distinguish once voicefrom others or via some type of an identifier like the phone number ofthe HU device used to call the AU. Once the HU voice is identified, theASR service accesses an HU voice model associated with the HU voice anduses that model to perform automated captioning.

One problem with systems that require an ASR service to store HU voicemodels is that HUs may prefer to not have their voice models stored bythird party ASR service providers or at least to not have the modelsstored and associated with specific HUs. Another problem may be thatregulatory agencies may not allow a third party ASR service provider tomaintain HU voice models or at least models that are associated withspecific HUs. Once solution is that no information useable to associatean HU with a voice model may be stored by an ASR service provider. Here,instead of using an HU identifier like a phone number or other networkaddress associated with an HU's device to identify an HU, an ASR servermay be programmed to identify an HU's voice signal from analysis of thevoice signal itself in an anonymous way.

Another solution may be for an AU device to store HU voice models forfrequent callers where each model is associated with an HU identifierlike a phone number or network address associated with a specific HUdevice. Here, when a call is received at an AU device, the AU deviceprocessor may use the number or address associated with the HU device toidentify which voice model to associate with the HU device. Then, the AUdevice may forward the HU voice model to the ASR service provider 1006to be used temporarily during the call to generate ASR text. Similarly,instead of forwarding an HU voice model to the ASR service provider, theAU device may simply forward an intermediate identification number orother identifier associated with the HU device to the ASR provider andthe provider may associate the number with a specific HU voice modelstored by the provider to access an appropriate HU voice model to usefor text transcription. Here, for instance, where an AU supports tendifferent HU voice models for 10 most recent HU callers, the models maybe associated with number 1 through 10 and the AU may simply forward onone of the intermediate identifiers (e.g., “7”) to the ASR provider 1006to indicate which one of ten voice models maintained by the ASR providerfor the AU to use with the HU voice transmitted.

In still other cases an HU may maintain one or more HU voice models thatcan be forwarded on to an ASR provider either through the relay ordirectly to generate text.

In at least some cases other more complex quality analysis andstatistics are contemplated that may be useful in determining betterways to train CAs as well as in assessing CA quality values. Forinstance, it has been recognized that voice to text errors can generallybe split into two different categories referred to herein as “visible”and “invisible” errors. Visible errors are errors that result in textthat, upon reading, is clearly erroneous while invisible errors areerrors that result in text that, despite the error that occurred, makessense in context. For instance, where an HU voices the phrase “We aremeeting at Joe's restaurant at 9 PM”, in a text transcription “We aremeeting at Joe's rodent for pizza at 9 PM”, the word “rodent” is a“visible” error in the sense that an AU reading the phrase would quicklyunderstand that the word “rodent” makes no sense in context. On theother hand, if the HU's phrase were transcribed as “We are meeting atJoe's room for pizza at 9 PM”, the erroneous word “room” is notcontextually wrong and therefore cannot be easily discerned as an error.Where the word room is replaced by restaurant, an AU could easily get awrong impression and for that reason invisible errors are generallyconsidered worse than visible errors.

In at least some cases it is contemplate that some mechanism fordistinguishing visible and invisible text transcription errors may beincluded in a relay quality testing system. For instance, where 10errors are made during some sub-period of an HU-AU call, three of theerrors may be identified as invisible while 7 are visible. Here, becauseinvisible errors typically have a worse effect on communicationeffectiveness, statistics that capture relative numbers of invisible toall errors should be useful in assessing CA or ASR quality.

In at least some systems it is contemplated that a relay server may beprogrammed to automatically identify at least visible errors so thatstatistics related thereto can be captured. For instance, the server maybe able to contextually examine text and identify words of phrases thatsimply make no sense and may identify each of those nonsensical errorsas a visible error. Here, because invisible errors make contextualsense, there is no easy algorithm by which a processor or server canidentify invisible errors. For this reason in at least some cases acorrecting CA (See 1053 in FIG. 31) may be required to identifyinvisible errors or, in the alternative, the system may be programmed toautomatically use CA corrections to identify invisible errors. In thisregard, any time a CA changes a word in a text phrase that initiallymade sense within the phrase to another word that contextually makessense in the phrase, the system may recognize that type of correction tohave been associated with an invisible error.

In at least some cases it is contemplated that the decision to switchcaptioning methods may be tied at least in part to the types of errorsthat are identified during a call. For instance, assume that a CA iscurrently generating text corresponding to an HU voice signal and thatan ASR is currently training to the HU voice signal but is not currentlyat a high enough quality threshold to cut out the CA transcriptionprocess. Here, there may be one threshold for the CA quality valuegenerally and another for the CA invisible error rate where, if eitherof the two thresholds are met, the system automatically cuts the CA out.For example, the threshold CA quality value may require 95% accuracy andthe CA invisible error rate may be 20% coupled with a 90% overallaccuracy requirement. Thus, here, if the invisible error rate amounts to20% or less of all errors and the overall CA text accuracy is above 90%(e.g., the invisible error rate is less than 2% of all words uttered bythe HU), the CA may be cut out of the call and ASR text relied upon forcaptioning. Other error types are contemplated and a system fordistinguishing each of several errors types from one another forstatistical reporting and for driving the captioning triage process arecontemplated.

In at least some cases when to transition from CA generated text to ASRgenerated text may be a function of not just a straight up comparison ofASR and CA quality values and instead may be related to both quality andrelative latency associated with different transcription methods. Inaddition, when to transition in some cases may be related to acombination of quality values, error types and relative latency as wellas to user preferences.

Other triage processes for identifying which HU voice to text methodshould be used are contemplated. For instance, in at least someembodiments when an ASR service or ASR software at a relay is being usedto generate and transmit text to an AU device for display, if an ASRquality value drops below some threshold level, a CA may be patched into the call in an attempt to increase quality of the transcribed text.Here, the CA may either be a full revoicing and correcting CA, just acorrecting CA that starts with the ASR generated text and makescorrections or a first CA that revoices and a second CA that makescorrections. In a case where a correcting CA is brought into a call, inat least some cases the ASR generated text may be provided to the AUdevice for display at the same time that the ASR generated text is sentto the CA for correction. In that case, corrected text may betransmitted to the AU device for in line correction once generated bythe CA. In addition, the system may track quality of the CA correctedtext and store a CA quality value in a system database.

In other cases when a CA is brought into a call, text may not betransmitted to the AU device until the CA has corrected that text andthen the corrected text may be transmitted.

In some cases, when a CA is linked to a call because the ASR generatedtext was not of a sufficiently high quality, the CA may simply startcorrecting text related to HU voice signal received after the CA islinked to the call. In other cases the CA may be presented with textassociated with HU voice signal that was transcribed prior to the CAbeing linked to the call for the CA to make corrections to that text andthen the CA may continue to make corrections to the text as subsequentHU voice signal is received.

Thus, as described above, in at least some embodiments an HU'scommunication device will include a display screen and a processor thatdrives the display screen to present a quality indication of thecaptions being presented to an AU. Here, the quality characteristic mayinclude some accuracy percentage, the actual text being presented to theAU, or some other suitable indication of caption accuracy or an accuracyestimation. In addition, the HU device may present one or more optionsfor upgrading the captioning quality such as, for instance, requestingCA correction of automated text captioning, requesting CA transcriptionand correction, etc.

Additional Specification

In at least some embodiments described above various HU voice delayconcepts have been described where an HU's voice signal broadcast isdelayed in order to bring the voice signal broadcast more temporally inline with associated captioned text. Thus, for instance, in a systemthat requires at least three seconds (and at times more time) totranscribe an HU's voice signal to text for presentation, a systemprocessor may be programmed to introduce a three second delay in HUvoice broadcast to an AU to bring the HU voice signal broadcast moreinto simultaneous alignment with associated text generated by thesystem. As another instance in a system where an AVR requires at leasttwo seconds to transcribe an HU's voice signal to text for presentationto a correcting CA, the system processor may be programmed to introducea two second delay in the HU voice that is broadcast to an AU to bringthe HU voice signal broadcast for into temporal alignment with the ASRgenerated text.

In the above examples, the three and two second delays are simply basedon the average minimum voice-to-text delays that occur with a specificvoice to text system and therefore, at most times, will only impreciselyalign an HU voice signal with corresponding text. For instance, in acase where HU voice broadcast is delayed three seconds, if texttranscription is delayed ten seconds, the three second delay would beinsufficient to align the broadcast voice signal and text presentation.As another instance, where the HU voice is delayed three seconds, if atext transcription is generated in one second, the three second delaywould cause the HU voice to be broadcast two seconds after presentationof the associated text. In other words, in this example, the threesecond HU voice delay would be too much delay at times and too little atother times and misalignment could cause assisted user confusion.

In at least some embodiments it is contemplated that a transcriptionsystem may assign time stamps to various utterances in an HU's voicesignal and those time stamps may also be assigned to text that is thengenerated from the utterances so that the HU voice and text can beprecisely synchronized per user preferences (e.g., precisely aligned intime or, if preferred by an AU, with an HU's voice preceding or delayedwith respect to text by the same persistent period) when broadcast andpresented to the AU, respectively. While alignment per an AU'spreferences may cause an HU voice to be broadcast prior to or afterpresentation of associated text, hereinafter, unless indicatedotherwise, it will be assumed that an AU's preference is that the HUvoice and related text be broadcast and presented simultaneously atsubstantially the same time. It should be recognized that in anyembodiment described hereafter where the description refers to alignedor simultaneous voice and text, the same teachings will be applicable tocases where voice and text are purposefully misaligned by a persistentperiod (e.g., always misaligned by 3 seconds per user preference).

Various systems are contemplated for assigning time stamps to HU voicesignals and associated text words and/or phrases. In a first relativelysimple case, an AU device that receives an HU voice signal may assignperiodic time stamps to sequentially received voice signal segments andstore the HU voice signal segments along with associated time stamps.The AU device may also transmit at least an initial time stamp (e.g.corresponding to the beginning of the HU voice signal or the beginningof a first HU voice signal segment during a call) along with the HUvoice signal to a relay when captioning is to commence.

In at least some embodiments the relay stores the initial time stamp inassociation with the beginning instant of the received HU voice signaland continues to store the HU voice signal as it is received. Inaddition, the relay operates its own timer to generate time stamps foron-going segments of the HU voice signal as the voice signal is receivedand the relay generated time stamps are stored along with associated HUvoice signal segments (e.g., one time stamp for each segment thatcorresponds to the beginning of the segment). In a case where a relayoperates an ASR engine or taps into a fourth party ASR service (e.g.,Google Voice, IBM's Watson, etc.) where a CA checks and corrects ASRgenerated text, the ASR engine generates automated text for HU voicesegments in real time as the HU voice signal is received.

A CA computer at the relay simultaneously broadcasts the HU voicesegments and presents the ASR generated text to a CA at the relay forcorrection. Here, the ASR engine speed will fluctuate somewhat based onseveral factors that are known in the speech recognition art so that itcan be assumed that the ASR engine will translate a typical HU voicesignal segment to text within anywhere between a fraction of a second(e.g., one tenth of a second) to 10 seconds. Thus, where the CA computeris configured to simultaneously broadcast HU voice and present ASRgenerated text for CA consideration, the relay is programmed to delaythe HU voice signal broadcast dynamically for a period within the rangeof a fraction of a second up to the maximum number of seconds requiredfor the ASR engine to transcribe a voice segment to text. Again, here, aCA may have control over the timing between text presentation and HUvoice broadcast and may prefer one or the other of the text and voice toprecede the other (e.g., HU voice to proceed corresponding text by twoseconds or vice versa). In these cases, the preferred delay betweenvoice and text can be persistent and unchanging which results in less CAconfusion.

After a CA corrects text errors in the ASR engine generated text, in atleast some cases the relay transmits the time stamped text back to theAU caption device for display to the AU. Upon receiving the time stampedtext from the relay, the AU device accesses the time stamped HU voicesignal stored thereat and associates the text and HU voice signalsegments based on similar (e.g., closest in time) or identical timestamps and stores the associated text and HU voice signal untilpresented and broadcasted to the AU. The AU device then simultaneously(or delayed per user preference) broadcasts the HU voice signal segmentsand presents the corresponding text to the AU via the AU caption devicein at least some embodiments.

A flow chart that is consistent with this simple first case of timestamping text segments is shown in FIG. 32 and will be described next.Referring also to FIG. 33, a system similar to the system describedabove with respect to FIG. 1 is illustrated where similar elements arelabelled with the same numbers used in FIG. 1 and, unless indicatedotherwise, operates in a similar fashion. The primary differencesbetween the FIG. 1 system and the system described in FIG. 33 is thateach of the AU caption device 12 and the relay 16 includes a memorydevice that stores, among other things, time stamped voice messagesegments corresponding to a received HU voice signal and that timestamps are transmitted between AU device 12 and relay server 30 (see1034 and 1036).

Referring to FIGS. 32 and 33, during a call between an HU using an HUdevice 14 and an AU using AU device 12, at some point, captioning isrequired by the AU (e.g., either immediately when the call commences orupon selection of a caption option by the AU) at which point AU device12 performs several functions. First, after captioning is to commence,at block 1102, the HU voice signal is received by the AU device 12. Atblock 1104, AU device 12 commences assignment and continues to assignperiodic time stamps to the HU voice signal segments received at the AUdevice. The time stamps include an initial time stamp t0 correspondingto the instant in time when captioning is to commence or some specificinstant in time thereafter as well as following time stamps. Inaddition, at block 1104, AU device 12 commences storing the received HUvoice signal along with the assigned time stamps that divide up the HUvoice signal into segments in AU device memory 1030.

Referring still to FIGS. 32 and 33, at block 1106, AU device 12transmits the HU voice signal segments to relay 16 along with theinitial time stamp t0 corresponding to the instant captioning wasinitiated where the initial time stamp is associated with the start ofthe first HU voice segment transmitted to the relay (see 1034 in FIG.33). At block 1108, relay 16 stores the initial time stamp t0 along withthe first HU voice signal segment in memory 1032, runs its own timer toassign subsequent time stamps to the HU voice signal received and storesthe HU voice signal segments and relay generated time stamps in memory1032. Here, because both the AU device and the relay assign the initialtime stamp t0 to the same point within the HU voice signal and eachassigns other stamps based on the initial time stamp, all of the AUdevice and relay time stamps should be aligned assuming that eachassigns time stamps at the same periodic intervals (e.g., every second).

In other cases, each of the AU device and relay may assign second andsubsequent time stamps having the form (t0+Δt) where Δt is a period oftime relative to the initial time stamp to. Thus, for instance, a secondtime stamp may be (t0+1 sec), a third time stamp may be (t0+4 sec), etc.In this case, the AU device and relay may assign time stamps that have adifferent periods where the system simply aligns stamps text and voicewhen required based on closest stamps in time.

Continuing, at block 1110, relay 16 runs an ASR engine to generate ASRengine text for each of the stored HU voice signal segments and storesthe ASR engine text with the corresponding time stamped HU voice signalsegments. At block 1112, relay 16 presents the ASR engine text to a CAfor consideration and correction. Here, the ASR engine text is presentedvia a CA computer display screen 32 while the HU voice segments aresimultaneously (e.g., as text is scrolled onto display 32) broadcast tothe CA via headset 54. The CA uses display 32 and/or other interfacedevices to make corrections (see block 1116) to the ASR engine text.Corrections to the text are stored in memory 1032 and the resulting textis transmitted at block 1118 to AU device 12 along with a separate timestamp for each of the text segments (see 1036 in FIG. 33).

Referring yet again to FIGS. 32 and 33, upon receiving the time stampedtext, AU device 12 correlates the time stamped text with the HU voicesignal segments and associated time stamps in memory 1130 and stores thetext with the associated voice segments and related time stamps at block1120. At block 1122, in some embodiments, AU device 12 simultaneouslybroadcasts and presents the correlated HU voice signal segments and textsegments to the AU via an AU device speaker and the AU device displayscreen, respectively.

Referring still to FIG. 32, it should be appreciated that the timestamps applied to HU voice signal segments and corresponding textsegments enable the system to align voice and text when presented toeach of a CA and an AU. In other embodiments it is contemplated that thesystem may only use time stamps to align voice and text for one or theother of a CA and an AU. Thus, for instance, in FIG. 32, thesimultaneous broadcast step at 1112 may be replaced by voice broadcastand text presentation immediately when available and synchronouspresentation and broadcast may only be available to the AU at step 1122.In a different system synchronous voice and text may be provided to theCA at step 1112 while HU voice signal and caption text are independentlypresented to the AU immediately upon reception at steps 1102 and 1122,respectively.

In the FIG. 32 process, the AU only transmits an initial HU voice signaltime stamp to the relay corresponding to the instant when captioningcommences. In other cases it is contemplated that AU device 12 maytransmit more than one time stamp corresponding to specific points intime to relay 16 that can be used to correct any voice and text segmentmisalignment that may occur during system processes. Thus, for instance,instead of sending just the initial time stamp, AU device 12 maytransmit time stamps along with specific HU voice segments every 5seconds or every 10 seconds or every 30 seconds, etc., while a callpersists, and the relay may simply store each newly received time stampalong with an instant in the stream of HU voice signal received.

In still other cases AU device 12 may transmit enough AU devicegenerated time stamps to relay 16 that the relay does not have to runits own timer to independently generate time stamps for voice and textsegments. Here, AU device 12 would still store the time stamped HU voicesignal segments as they are received and stamped and would correlatetime stamped text received back from the relay 16 in the same fashion sothat HU voice segments and associated text can be simultaneouslypresented to the AU.

A sub-process 1138 that may be substituted for a portion of the processdescribed above with respect to FIG. 32 is shown in FIG. 34, albeitwhere all AU device time stamps are transmitted to and used by a relayso that the relay does not have to independently generate time stampsfor HU voice and text segments. In the modified process, referring alsoand again to FIG. 32, after AU device 12 assigns periodic time stamps toHU voice signal segments at block 1104, control passes to block 1140 inFIG. 34 where AU device 12 transmits the time stamped HU voice signalsegments to relay 16. At block 1142, relay 16 stores the time stamped HUvoice signal segments after which control passes back to block 1110 inFIG. 32 where the relay employs an ASR engine to convert the HU voicesignal segments to text segments that are stored with the correspondingvoice segments and time stamps. The process described above with respectto FIG. 32 continues as described above so that the CA and/or the AU arepresented with simultaneous HU voice and text segments.

In other cases it is contemplated that an AU device 12 may not assignany time stamps to the HU voice signal and, instead, the relay or afourth party ASR service provider may assign all time stamps to voiceand text signals to generate the correlated voice and text segments. Inthis case, after text segments have been generated for each HU voicesegment, the relay may transmit both the HU voice signal and thecorresponding text back to AU device 12 for presentation.

A process 1146 that is similar to the FIG. 32 process described above isshown in FIG. 35, albeit where the relay generates and assigns all timestamps to the HU voice signals and transmits the correlated time stamps,voice signals and text to the AU device for simultaneous presentation.In the modified process 1146, process steps 1150 through 1154 in FIG. 35replace process steps 1102 through 1108 in FIG. 32 and process steps1158 through 1162 in FIG. 35 replace process steps 1118 through 1122 inFIG. 32 while similarly numbered steps 1110 through 1116 aresubstantially identical between the two processes.

Process 1146 starts at block 1150 in FIG. 35 where AU device 12 receivesan HU voice signal from an HU device where the HU voice signal is to becaptioned. Without assigning any time stamps to the HU voice signal, AUdevice 12 links to a relay 16 and transmits the HU voice signal to relay16 at block 1152. At block 1154, relay 16 uses a timer or clock togenerate time stamps for HU voice signal segments after which controlpasses to block 1110 where relay 16 uses an ASR engine to convert the HUvoice signal to text which is stored along with the corresponding HUvoice signal segments and related time stamps. At block 1112, relay 16simultaneously presents ASR text and broadcasts HU voice segments to aCA for correction and the CA views the text and makes corrections atblock 1116. After block 1116, relay 16 transmits the time stamped textand HU voice segments to AU device 12 and that information is stored bythe AU device as indicated at block 1160. At block 1162, AU device 12simultaneously broadcasts and presents corresponding HU voice and textsegments via the AU device display.

In cases where HU voice signal broadcast is delayed so that thebroadcast is aligned with presentation of corresponding transcribedtext, delay insertion points will be important in at least some cases orat some times. For instance, an HU may speak for 20 consecutive secondswhere the system assigns a time stamp every 2 seconds. In this case, onesolution for aligning voice with text would be to wait until the entire20 second spoken message is transcribed and then broadcast the entire 20second voice message and present the transcribed text simultaneously.This, however, is a poor solution as it would slow down HU-AUcommunication appreciably.

Another solution would be to divide up the 20 second voice message into5 second periods with silent delays therebetween so that thetranscription process can routinely catch up. For instance, here, duringa first five second period plus a short transcription catch up period(e.g., 2 seconds), the first five seconds of the 20 second HU voicemassage is transcribed. At the end of the first 7 seconds of HU voicesignal, the first five seconds of HU voice signal is broadcast and thecorresponding text presented to the AU while the next 5 seconds of HUvoice signal is transcribed. Transcription of the second 5 seconds of HUvoice signal may take another 7 seconds which would meant that a 2second delay or silent period would be inserted after the first fiveseconds of HU voice signal is broadcast to the AU. This process ofinserting periodic delays into HU voice broadcast and text presentationwhile transcription catches up continues. Here, while it is possiblethat the delays at the five second times would be at ideal times betweenconsecutive natural phrases, more often than not, the 5 second pointdelays would imperfectly divide natural language phrases making it more,not less difficult, to understand the overall HU voice message.

A better solution is to insert delays between natural language phraseswhen possible. For instance, in the case of the 20 second HU voicesignal example above, a first delay may be inserted after a first 3second natural language phrase, a second delay may be inserted after asecond 4 second natural language phrase, a third delay may be insertedafter a third 5 second natural language phrase, a fourth delay may beinserted after a fourth 2 second natural language phrase and a fifthdelay may be inserted after a fifth 2 second natural language phrase, sothat none of the natural language phrases during the voice message arebroken up by intervening delays.

Software for identifying natural language phrases or natural breaks inan HU's voice signal may use actual delays between consecutive spokenphrases as one proxy for where to insert a transcription catch up delay.In some cases software may be able to perform word, sentence and/ortopic segmentation in order to identify natural language phrases. Othersoftware techniques for dividing voice signals into natural languagephrases are contemplated and should be used as appropriate.

Thus, while some systems may assign perfectly periodic time stamps to HUvoice signals to divide the signals into segments, in other cases timestamps will be assigned at irregular time intervals that make more sensegiven the phrases that an HU speaks, how an HU speaks, etc.

Where time stamps are assigned to HU voice and text segments, voicesegments can be more accurately selected for replay via selection ofassociated text. For instance, see FIG. 36 that shows a CA displayscreen 50 with transcribed text represented at 1200. Here, as text isgenerated by a relay ASR engine and presented to a CA, consistent withat least some of the systems described above, the CA may select a wordor phrase in presented text via touch (represented by hand icon 1202) toreplay the HU voice signal associated therewith. When a word is selectedin the presented text several things will happen in at least somecontemplated embodiments. First, a current voice broadcast to the CA ishalted. Second, the selected word is highlighted (see 1204) or otherwisevisually distinguished. Third, when the word is highlighted, the CAcomputer accesses the HU voice segment associated with the highlightedword and re-broadcasts the voice segment for the CA to re-listen to theselected word. Where time stamps are assigned with short interveningperiods, the time stamps should enable relatively precise replay ofselected words from the text. In at least some cases, the highlight willremain and the CA may change the highlighted word or phrase via standardtext editing tools.

In some cases a “Resume” or other icon 1210 may be presented proximatethe selected word that can be selected via touch to continue the HUvoice broadcast and text presentation at the location where the systemleft off when the CA selected the word for re-broadcast. In other cases,a short time (e.g., ¼th second to 3 seconds) after rebroadcasting aselected word or phrase, the system may automatically revert back to thevoice and text broadcast at the location where the system left off whenthe CA selected the word for re-broadcast.

While not shown, in some cases when a text word is selected, the systemwill also identify other possible words that may correspond to the voicesegment associated with the selected word (e.g., second and third bestoptions for transcription of the HU voice segment associated with theselected word) and those options may be automatically presented fortouch selection and replacement via a list of touch selectable icons,one for each option, similar to Resume icon 1210. Here, the options maybe presented in a list where the first list entry is the most likelysubstitute text option, the second entry is the second most likelysubstitute text option, and so on.

Referring again to FIG. 36, in other cases when a text word is selectedon a CA display screen 50, a relay server or the CA's computer mayselect an HU voice segment that includes the selected word and alsoother words in an HU voice segment or phrase that includes the selectedword for re-broadcast to the CA so that the CA has some audible contextin which to consider the selected word. Here, when the phrase lengthsegment is re-broadcast, the full text phrase associated therewith maybe highlighted as shown at 1206 in FIG. 36. In some cases, the selectedword may be highlighted or otherwise visually distinguished in one wayand the phrase length segment that includes the selected word may behighlighted or otherwise visually distinguished in a second way that isdiscernably different to the CA so that the CA is not confused as towhat was selected (e.g., see different highlighting at 1204 and 1206 inFIG. 36).

In some cases a single touch on a word may cause the CA computer tore-broadcast the single selected word while highlighting the selectedword and the associated longer phrase that includes the selected worddifferently while a double tap on a word may cause the phrase thatincludes the selected word to be re-broadcast to provide audio context.Where the system divides up an HU voice signal by natural phrases,broadcasting a full phrase that includes a selected word should beparticularly useful as the natural language phrase should be associatedwith a more meaningful context than an arbitrary group of wordssurrounding the selected word.

Upon selection of Resume icon 1210, the highlighting is removed from theselected word and the CA computer restarts simultaneously broadcastingthe HU voice signal and presenting associated transcribed text at thepoint where the computer left off when the re-broadcast word wasselected. In some cases, the CA computer may back up a few seconds fromthe point where the computer left off to restart the broadcast tore-contextualize the voice and text presented to the CA as the CA againbegins correcting text errors.

In other cases, instead of requiring a user to select a “Resume” option,the system may, after a short period (e.g., one second after theselected word or associated phrase is re-broadcast), simply revert backto broadcasting the HU voice signal and presenting associatedtranscribed text at the point where the computer left off when there-broadcast word was selected. Here, a beep or other audiblydistinguishable signal may be generated upon word selection and at theend of a re-broadcast to audibly distinguish the re-broadcast frombroadcast HU voice. In other cases any re-broadcast voice signal may beaudibly modified in some fashion (e.g., higher pitch or tone, greatervolume, etc.) to audibly distinguish the re-broadcast from other HUvoice signal broadcast.

Referring now to FIG. 37, a screen shot akin to the screen shot shown inFIG. 26 is illustrated at 50 that may be presented to an AU via an AUdevice display, albeit where an AU has selected a word from withintranscribed text for re-broadcast. In at least some embodiments, similarto the CA system described above, when an AU selects a word frompresented text, the instantaneous HU voice broadcast and textpresentation is halted, the selected word is highlighted or otherwisevisually distinguished as shown at 1230 and the phrase including theselected word may also be differently visually distinguished. Beeps orother audible signals may be generated immediately prior to and afterre-broadcast of a voice signal segment. When a word is selected, the AUdevice speaker (e.g., the speaker in associated handset 22)re-broadcasts the HU voice signal that is associated through theassigned time stamp to the selected word. In other cases the AU devicewill re-broadcast the entire phrase or sub-phrase that includes theselected word to give audio context to the selected word.

While the time stamping concept is described above with respect to asystem where an ASR initially transcribes an HU voice signal to text anda CA corrects the ASR generated text, the time stamping concept is alsoadvantageously applicable to cases where a CA transcribes an HU voicesignal to text and then corrects the transcribed text or where a secondCA corrects text transcribed by a first CA. To this end, in at leastsome cases it is contemplated that an ASR may operate in the backgroundof a CA transcription system to generate and time stamp ASR text (e.g.,text generated by an ASR engine) in parallel with the CA generated text.A processor may be programmed to compare the ASR text and CA generatedtext to identify at least some matching words or phrases and to assignthe time stamps associated with the matching ASR generated words orphrases to the matching CA generated text.

It is recognized that the CA text will likely be more accurate than theASR text most of the time and therefore that there will be differencesbetween the two text strings. However, some if not most of the time theASR and CA generated texts will match so that many of the time stampsassociated with the ASR text can be directly applied to the CA generatedtext to align the HU voice signal segments with the CA generated text.In some cases it is contemplated that confidence factors may begenerated for likely associated ASR and CA generated text and timestamps may only be assigned to CA generated text when a confidencefactor is greater than some threshold confidence factor value (e.g.,88/100). In most cases it is expected that confidence factors thatexceed the threshold value will occur routinely and with shortintervening durations so that a suitable number of reliable time stampscan be generated.

Once time stamps are associated with CA generated text, the stamps maybe used to precisely align HU voice signal broadcast and textpresentation to an AU or a CA (e.g., in the case of a second “correctingCA”) as described above as well as to support re-broadcast of HU voicesignal segments corresponding to selected text by a CA and/or an AU.

A sub-process 1300 that may be substituted for a portion of the FIG. 32process is shown in FIG. 38, albeit where ASR generated time stamps areapplied to CA generated text. Referring also to FIG. 32, steps 1302through 1310 shown in FIG. 38 are swapped into the FIG. 32 process forsteps 1112 through 1118. Referring also to FIG. 32, after an ASR enginegenerates and stores time stamped text segments for a received HU voicesignal segment, control passes to block 1302 in FIG. 38 where the relaybroadcasts the HU voice signal to a CA and the CA revoices the HU voicesignal to transcription software trained to the CA's voice and thesoftware yields CA generated text.

At block 1304, a relay server or processor compares the ASR text to theCA generated text to identify high confidence “matching” words and/orphrases. Here, the phrase high confidence means that there is a highlikelihood (e.g., 95% likely) that an ASR text word or phrase and a CAgenerated text word or phrase both correspond to the exact same HU voicesignal segment. Characteristics analyzed by the comparing processorinclude multiple word identical or nearly identical strings in comparedtext, temporally when text appears in each text string relative to otherassigned time stamps, easily transcribed words where both an ASR and aCA are highly likely to accurately transcribe words, etc. In some casestime stamps associated with the ASR text are only assigned to the CAgenerated text when the confidence factor related to the comparison isabove some threshold level (e.g., 88/100). Time stamps are assigned atblock 1306 in FIG. 38.

At block 1308, the relay presents the CA generated text to the CA forcorrection and at block 1310 the relay transmits the time stamped CAgenerated text segments to the AU device. After block 1310 controlpasses back to block 1120 in FIG. 32 where the AU device correlates timestamped CA generated text with HU voice signal segments previouslystored in the AU device memory and stores the times, text and associatedvoice segments. At block 1122, the AU device simultaneously broadcastsand presents identically time stamped HU voice and CA generated text toan AU. Again, in some cases, the AU device may have already broadcastthe HU voice signal to the AU prior to block 1122. In this case, uponreceiving the text, the text may be immediately presented via the AUdevice display to the AU for consideration. Here, the time stamped HUvoice signal and associated text would only be used by the AU device tosupport synchronized HU voice and text re-play or representation.

In some cases the time stamps assigned to a series of text and voicesegments may simply represent relative time stamps as opposed to actualtime stamps. For instance, instead of labelling three consecutive HUvoice segments with actual times 3:55:45 AM; 3:55:48 AM; 3:55:51 AM . .. , the three segments may be labelled t0, t1, t2, etc., where thelabels are repeated after they reach some maximum number (e.g., t20). Inthis case, for instance, during a 20 second HU voice signal, the 20second signal may have five consecutive labels t0, t1, t2, t3 and t4assigned, one every four seconds, to divide the signal into fiveconsecutive segments. The relative time labels can be assigned to HUvoice signal segments and also associated with specific transcribed textsegments.

In at least some cases it is contemplated that the rate of time stampassignment to an HU voice signal may be dynamic. For instance, if an HUis routinely silent for long periods between intermittent statements,time stamps may only be assigned during periods while the HU isspeaking. As another instance, if an HU speaks slowly at times and morerapidly at other times, the number of time stamps assigned to the user'svoice signal may increase (e.g., when speech is rapid) and decrease(e.g., when speech is relatively slow) with the rate of user speech.Other factors may affect the rate of time stamps applied to an HU voicesignal.

While the systems describe above are described as ones where time stampsare assigned to an HU voice signal by either or both of an assisteduser's device and a relay, in other cases it is contemplated that othersystem devices or processors may assign time stamps to the HU voicesignal including a fourth party ASR engine provider (e.g., IBM's Watson,Google Voice, etc.). In still other cases where the HU device is acomputer (e.g., a smart phone, a tablet type computing device, a laptopcomputer), the HU device may assign time stamps to the HU voice signaland transmit to other system devices that need time stamps. Allcombinations of system devices assigning new or redundant time stamps toHU voice signals are contemplated.

In any case where time stamps are assigned to voice signals and textsegments, words, phrases, etc., the engine(s) assigning the time stampsmay generate stamps indicating any of (1) when a word or phrase isvoiced in an HU voice signal audio stream (e.g., 16:22 to 16:22:5corresponds to the word “Now”) and (2) the time at which text isgenerated by the ASR for a specific word (e.g., “Now” generated at16:25). Where a CA generates text or corrects text, a processor relatedto the relay may also generate time stamps indicating when a CAgenerated word is generated as well as when a correction is generated.

In at least some embodiments it is contemplated that any time a CA fallsbehind when transcribing an HU voice signal or when correcting an ASRengine generated text stream, the speed of the HU voice signal broadcastmay be automatically increased or sped up as one way to help the CAcatch up to a current point in an HU-AU call. For instance, in a simplecase, any time a CA caption delay (e.g., the delay between an HU voiceutterance and CA generation of text or correction of text associatedwith the utterance) exceeds some threshold (e.g., 12 seconds), the CAinterface may automatically double the rate of HU signal broadcast tothe CA until the CA catches up with the call.

In at least some cases the rate of broadcast may be dynamic between anominal value representing the natural speaking speed of the HU and amaximum rate (e.g., increase the natural HU voice speed three times),and the instantaneous rate may be a function of the degree of captioningdelay. Thus, for instance, where the captioning delay is only 4 or lessseconds, the broadcast rate may be 1 representing the natural speakingspeed of the HU, if the delay is between 4 and 8 seconds the rebroadcastrate may be 2 (e.g., twice the natural speaking speed), and if the delayis greater than 8 seconds, the broadcast rate may be 3 (e.g., threetimes the natural speaking speed).

In other cases the dynamic rate may be a function of other factors suchas but not limited to the rate at which an HU utters words, perceivedclarity in the connection between the HU and AU devices or between theAU device and the relay or between any two components within the system,the number of corrections required by a CA during some sub-call period(e.g., the most recent 30 seconds), statistics related to how accuratelya CA can generate text or make text corrections at different speakingrates, some type of set AU preference, some type of HU preference, etc.

In some cases the rate of HU voice broadcast may be based on ASRconfidence factors. For instance, where an ASR assigns a high confidencefactor to a 15 second portion of HU voice signal and a low confidencefactor to the next 10 seconds of the HU voice signal, the HU voicebroadcast rate may be set to twice the rate of HU speaking speed duringthe first 15 second period and then be slowed down to the actual HUspeaking speed during the next 10 second period.

In some cases the HU broadcast rate may be at least in part based oncharacteristics of an HU's utterances. For instance, where an HU'svolume on a specific word is substantially increased or decreased, theword (or phrase including the word) may always be presented at the HUspeaking speed (e.g., at the rate uttered by the HU). In other cases,where the volume of one word within a phrase is stressed, the entirephrase may be broadcast at speaking speed so that the full effect of thestressed word can be appreciated. As another instance, where an HU drawsout pronunciation of a word such as “Well . . . ” for 3 seconds, theword (or phrase including the word) may be presented at the spoken rate.

In some cases the HU voice broadcast rate may be at least in part basedon words spoken by an HU or on content expressed in an HU's spokenwords. For instance, simple words that are typically easy to understandincluding “Yes”, “No”, etc., may be broadcast at a higher rate thancomplex words like some medical diagnosis, multi-syllable terms, etc.

In cases where the system generates text corresponding to both HU and AUvoice signals, in at least some embodiments it is contemplated thatduring normal operation only text associated with the HU signal may bepresented to an AU and that the AU text may only be presented to the AUif the AU goes back in the text record to review the text associatedwith a prior part of a conversation. For instance, if an AU scrolls backin a conversation 3 minutes to review prior discussion, ASR generated AUvoice related text may be presented at that time along with the HU textto provide context for the AU viewing the prior conversation.

In the systems described above, whenever a CA is involved in a captionassisted call, the CA considers an entire HU voice signal and eithergenerates a complete CA generated text transcription of that signal orcorrects ASR generated text errors while considering the entire HU voicesignal. In other embodiments it is contemplated that where an ASR enginegenerates confidence factors, the system may only present sub-portionsof an HU voice signal to a CA that are associated with relatively lowconfidence factors for consideration to speed up the error correctionprocess. Here, for instance, where ASR engine confidence factors arehigh (e.g., above some high factor threshold) for a 20 second portion ofan HU voice signal and then is low for the next 10 seconds, a CA mayonly be presented the ASR generated text and the HU voice signal may notbe broadcast to the CA during the first 20 seconds while substantiallysimultaneous HU voice and text are presented to the CA during thefollowing 10 second period so that the CA is able to correct any errorsin the low confidence text. In this example, it is contemplated that theCA would still have the opportunity to select an interface option tohear the HU voice signal corresponding to the first 20 second period orsome portion of that period if desired.

In some cases only a portion of HU voice signal corresponding to lowconfidence ASR engine text may be presented at all times and in othercases, this technique of skipping broadcast of HU voice associated withhigh confidence text may only be used by the system during thresholdcatch up periods of operation. For instance, the technique of skippingbroadcast of HU voice associated with high confidence text may only kickin when a CA text correction process is delayed from an HU voice signalby 20 or more seconds.

In particularly advantages cases, low confidence text and associatedvoice may be presented to a CA at normal speaking speed and highconfidence text and associated voice may be presented to a CA at anexpedited speed (e.g., 3 time normal speaking speed) when a textpresentation delay (e.g., the period between the time an HU uttered aword and the time when a text representation of the word is presented tothe CA) is less than a maximum latency period, and if the delay exceedsthe maximum latency period, high confidence text may be presented inblock form (e.g., as opposed to rapid sequential presentation ofseparate words) without broadcasting the HU voice to expedite thecatchup process.

In cases where a system processor or sever determines when toautomatically switch or when to suggest a switch from a CA captioningsystem to an ASR engine captioning system, several factors may beconsidered including the following:

-   -   1. Percent match between ASR generated words and CA generated        words over some prior captioning period (e.g., last 30 seconds);    -   2. How accurate ASR confidence factors reflect corrections made        by a CA;    -   3. Words per minute spoken by an HU and how that affects        accuracy;    -   4. Average delay between ASR and CA generated text over some        prior captioning period;    -   5. An expressed AU preference stored in an AU preferences        database accessible by a system processor;    -   6. A current AU preferences as set during an ongoing call via an        on screen or other interface tool; and    -   7. Clarity of received signal or some other proxy for line        quality of the link between any two processors or servers within        the system.

Other factors are contemplated.

In at least some cases a speech recognition engine will sequentiallygenerate a sequence of captions for a single word or phrase uttered by aspeaker. For instance, where an HU speaks a word, an ASR engine maygenerate a first “estimate” of a text representation of the word basedsimply on the sound of the individual word and nothing more. Shortlythereafter (e.g., within 1 to 6 seconds), the ASR engine may considerwords that surround (e.g., come before and after) the uttered word alongwith a set of possible text representations of the word to identify afinal estimate of a text representation of the uttered word based oncontext derived from the surrounding words. Similarly, in the case of aCA revoicing an HU voice signal to an ASR engine trained to the CA voiceto generate text, multiple iterations of text estimates may occursequentially until a final text representation is generated.

In at least some cases it is contemplated that every best estimate of atext representation of every word to be transcribed will be transmittedimmediately upon generation to an AU device for continually updatedpresentation to the AU so that the AU has the best HU voice signaltranscription that exists at any given time. For instance, in a casewhere an ASR engine generates at least one intermediate text estimateand a final text representation of a word uttered by an HU and where aCA corrects the final text representation, each of the interim textestimate, the final text representation and the CA corrected text may bepresented to the AU where updates to the text are made as in linecorrections thereto (e.g., by replacing erroneous text with correctedtext directly within the text stream presented) or, in the alternative,corrected text may be presented above or in some spatially associatedlocation with respect to erroneous text.

In cases where an ASR engine generates intermediate and final textrepresentations while a CA is also charged with correcting text errors,if the ASR engine is left to continually make context dependentcorrections to text representations, there is the possibility that theASR engine could change CA generated text and thereby undue an intendedand necessary CA correction.

To eliminate the possibility of an ASR modifying CA corrected text, inat least some cases it is contemplated that automatic ASR enginecontextual corrections for CA corrected text may be disabled. In thiscase, for instance, when a CA initiates a text correction or completes acorrection in text presented on her device display screen, the ASRengine may be programmed to assume that the CA corrected text isaccurate from that point forward. In some cases, the ASR engine may beprogrammed to assume that a CA corrected word is a true transcription ofthe uttered word which can then be used as true context for ascertainingthe text to be associated with other ASR engine generated text wordssurrounding the true transcription. In some cases text words prior toand following the CA corrected word may be corrected by the ASR enginebased on the CA corrected word that provides new context. Hereinafter,unless indicated otherwise, when an ASR engine is disabled frommodifying a word in a text phrase, the word will be said to be “firm”.

In at least some cases it is contemplated that if a CA corrects a wordor words at one location in presented text, if an ASR subsequentlycontextually corrects a word or phrase that precedes the CA correctedword or words, the subsequent ASR correction may be highlighted orotherwise visually distinguished so that the CA's attention is calledthereto to consider the ASR correction. In at least some cases, when anASR corrects text prior to a CA text correction, the text that wascorrected may be presented in a hovering tag proximate the ASRcorrection and may be touch selectable by the CA to revert back to thepre-correction text if the CA so chooses. To this end, see the CAinterface screen shot 1391 shown in FIG. 43 where ASR generated text isshown at 1393 that is similar to the text presented in FIG. 39, albeitwith a few corrections. More specifically, in FIG. 43, it is assumedthat a CA corrected the word “cods” to “kids” at 1395 (compare again toFIG. 39) after which an ASR engine corrected the prior word “bing” to“bring”. The prior ASR corrected word is highlighted or distinguished asshown at 1397 and the word that was changed to make the correction ispresented in hovering tag 1399. Tag 1399 is touch selectable by the CAto revert back to the prior word if selected.

In other cases where a CA initiates or completes a word correction, theASR engine may be programmed to disable generating additional estimatesor hypothesis for any words uttered by the HU prior to the CA correctedword or within a text segment or phrase that includes the correctedword. Thus, for instance, in some cases, where 30 text words appear on aCA's display screen, if the CA corrects the fifth most recentlypresented word, the fifth most recently corrected word and the 25preceding words would be rendered firm and unchangeable via the ASRengine. Here, in some cases the CA would still be free to change anyword presented on her display screen at any time. In other cases, once aCA corrects a word, that word and any preceding text words may be firmas to both the CA and the ASR engine.

In some cases there may be restrictions on text corrections that may bemade by a CA. For instance, in a simple case where an AU device can onlypresent a maximum of 50 words to an AU at a time, the system may onlyallow a CA to correct text corresponding to the 50 words most recentlyuttered by an HU. Here, the idea is that in most cases it will make nosense for a CA to waste time correcting text errors in text prior to themost recently uttered 50 words as an AU will only rarely care to back upin the record to see prior generated and corrected text. Here, thewindow of text that is correctable may be a function of several factorsincluding font type and size selected by an AU on her device, the typeand size of display included in an AUs device, etc. This feature ofrestricting CA corrections to AU viewable text is effectively a limit onhow far behind CA error corrections can lag.

In some cases it is contemplated that a call may start out with full CAerror correction so that the CA considers all ASR engine generated textbut that, once the error correction latency exceeds some thresholdlevel, that the CA may only be able to or may be encouraged to onlycorrect low confidence text. For instance, the latency limit may be 10seconds at which point all ASR text is presented but low confidence textis visually distinguished in some fashion designed to encouragecorrection. To this end see for instance FIG. 40 where low and highconfidence text is presented in difference scrolling columns. In somecases error correction may be limited to the left column low confidencetext as illustrated. FIG. 40 is described in more detail hereafter.Where only low confidence text can be corrected, in at least some casesthe HU voice signal for the high confidence text may not be broadcast.

In some cases, only low confidence factor text and associated HU voicesignal may be presented and broadcast to a CA for consideration withsome indication of missing text and voice between the presented textwords or phrases. For instance, turn piping representations (see again216 in FIG. 17) may be presented to a CA between low confidence editabletext phrases.

In other cases, while interim and final ASR engine text may be presentedto an AU, a CA may only see final ASR engine text and therefore only beable to edit that text. Here, the idea is that most of the time ASRengine corrections will be accurate and therefore, by delaying CAviewing until final ASR engine text is generated, the number of requiredCA corrections will be reduced appreciably. It is expected that thissolution will become more advantageous as ASR engine speed increases sothat there is minimal delay between interim and final ASR engine textrepresentations.

In still other cases it is contemplated that only final ASR engine textmay be sent on to an AU for consideration. In this case, for instance,ASR generated text may be transmitted to an AU device in blocks wherecontext afforded by surrounding words has already been used to refinetext hypothesis. For instance, words may be sent in five word textblocks where the block sent always includes the 6th through 10th mostrecently transcribed words so that the most recent through fifth mostrecent words can be used contextually to generate final text hypothesisfor the 6th through 10th most recent words. Here, CA text correctionswould still be made at a relay and transmitted to the AU device for inline corrections of the ASR engine final text.

In this case, if a CA takes over the task of text generation from an ASRengine for some reason (e.g., an AU requests CA help), the system mayswitch over to transmitting CA generated text word by word as the textis generated. In this case CA corrections would again be transmittedseparately to the AU device for in line correction. Here, the idea isthat the CA generated text should be relatively more accurate than theASR engine generated text and therefore immediate transmission of the CAgenerated text to the AU would result in a lower error presentation tothe AU.

While not shown, in at least some embodiments it is contemplated thatturn piping type indications may be presented to a CA on her interfacedisplay as a representation of the delay between the CA text generationor correction and the ASR engine generated text. To this end, see theexemplary turn piping 216 in FIG. 17. A similar representation may bepresented to a CA.

Where CA corrections or even CA generated text is substantially delayed,in at least some cases the system may automatically force a split tocause an ASR engine to catch up to a current time in a call and to firmup text before the split time. In addition, the system may identify apreferred split prior to which ASR engine confidence factors are high.For instance, where ASR engine text confidence factors for spoken wordsprior to the most recent 15 words are high and for the last fifteenwords are low, the system may automatically suggest a split at the 15thmost recent word so that ASR text prior to that word is firmed up andtext thereafter is still presented to the CA to be considered andcorrected. Here, the CA may reject the split either by selecting arejection option or by ignoring the suggestion or may accept thesuggestion by selecting an accept option or by ignoring the suggestion(e.g., where the split is automatic if not rejected in some period(e.g., 2 seconds)). To this end, see the exemplary CA screen shot inFIG. 39 where ASR generated text is shown at 1332. In this case, the CAis behind in error correction so that the CA computer is currentlybroadcasting the word “want” as indicted by the “Broadcast” tag 1334that moves along the ASR generated text string to indicate to the CAwhere the current broadcast point is located within the overall string.A “High CF—Catch Up” tag 1338 is provided to indicate a point within theoverall ASR text string presented prior to which ASR confidence factorsare high and, after which ASR confidence factors are relatively lower.Here, it is contemplated that a CA would be able to select tag 1338 toskip to the tagged point within the text. If a CA selects tag 1338, thebroadcast may skip to the associated tagged point so that “Broadcast”tag 1334 would be immediately moved to the point tagged by tag 1338where the HU voice broadcast would recommence. In other cases, selectinghigh confidence tag 1338 may cause accelerated broadcast of text betweentags 1334 and 1338 to expedite catch up.

Referring to FIG. 40, another exemplary CA screen shot 1333 that may bepresented to show low and high confidence text segments and to enable aCA to skip to low confidence text and associated voice signal isillustrated. Screen shot 1333 divides text into two columns including alow confidence column 1335 and a high confidence column 1337. Lowconfidence column 1335 includes text segments that have ASR assignedconfidence factors that are less than some threshold value which highconfidence column 1337 include text segments that have ASR assignedconfidence factors that are greater than the threshold value. Column1335 is presented on the left half of screen shot 1333 and column 1337is presented on the right half of shot 1333. The two columns wouldscroll upward simultaneously as more text is generated. Again, a currentbroadcast tag 1339 is provided at a current broadcast point in thepresented text. Also, a “High CF, Catch Up” tag 1341 is presented at thebeginning of a low confidence text segment. Here, again, it iscontemplated that a CA may select the high confidence tag 1341 to skipthe broadcast forward to the associated point to expedite the errorcorrection process. As shown, in at least some cases, if the CA does notskip ahead by selecting tag 1341, the HU voice broadcast may be at 2X ormore the speaking speed so that catch up can be more rapid.

In at least some cases it is contemplated that when a call is receivedat an AU device or at a relay, a system processor may use the callingnumber (e.g., the number associated with the calling party or thecalling parties device) to identify the least expensive good option forgenerating text for a specific call. For instance, for a specific firstcaller, a robust and reliable ASR engine voice model may already existand therefore be useable to generate automated text without the need forCA involvement most of the time while no model may exist for a secondcaller that has not previously used the system. In this case, the systemmay automatically initiate captioning using the ASR engine and firstcaller voice model for first caller calls and may automatically initiateCA assisted captioning for second caller calls so that a voice model forthe second caller can be developed for subsequent use. Where thereceived call is from an AU and is outgoing to an HU, a similar analysisof the target HU may cause the system to initiate ASR engine captioningor CA assisted captioning.

In some embodiments identity of an AU (e.g., an AU's phone number orother communication address) may also be used to select which of two ormore text generation options to use to at least initiate captioning.Thus, some AU's may routinely request CA assistance on all calls whileothers may prefer all calls to be initiated as ASR engine calls (e.g.,for privacy purposes) where CA assistance is only needed upon requestfor relatively small sub-periods of some calls. Here, AU phone oraddress numbers may be used to assess optimal captioning type.

In still other cases both a called and a calling number may be used toassess optimal captioning type. Here, in some cases, an AU number oraddress may trump an HU number or address and the HU number or addressmay only be used to assess caption type to use initially when the AU hasno perceived or expressed preference.

Referring again to FIG. 39, it has been recognized that, in addition totext corresponding to an HU voice signal, an optimal AU interface needsadditional information that is related to specific locations within apresented text string. For instance, specific virtual control buttonsneed to be associated with specific text string locations. For example,see the “High CF—Catch Up” button in FIG. 39. As other examples, a“resume” tag as in FIG. 36 or a correction word (see FIG. 20) may needto be linked to a specific text location. As another instance, in somecases a “broadcast” tag indicating the word currently being broadcastmay have to be linked to a specific text location (see FIG. 39).

In at least some embodiments, a CA interface or even an AU interfacewill take a form where text lines are separated by at least one blankline that operates as an “additional information” field in which othertext location linked information or content can be presented. To thisend, see FIG. 39 where additional information fields are collectivelylabelled 1215. In other embodiments it is contemplated that theadditional information fields may also be provided below associated textlines. In still other embodiments, other text fields may be presented asseparate in line fields within the text strings (see 1217 in FIG. 40).

In many industries it has been recognized that if a tedious job can begamified, employee performance can be increased appreciably as employeeswork through obstacles to better personal scores and, in some cases, tocompete with each other. Here, in addition to increased personalperformance, an employing entity can develop insights into best workpractices that can be rolled out to other employees attempting to bettertheir performance. In the present case, various systems are beingdesigned to add gamification aspects to the text captioning processperformed by CAs. In this regard, in some cases it has been recognizedthat if a CA simply operates in parallel with an ASR engine to generatetext, a CA may be tempted to simply let the ASR engine generate textwithout diligent error correction.

To avoid CAs shirking their error correction responsibilities, in atleast some embodiments it is contemplated that a system processor thatdrives or is associated with a CA interface may introduce periodic andrandom known errors into ASR generated text that is presented to a CA astest errors. Here, the idea is that a CA should identify the test errorsand at least attempt to make corrections thereto. In most cases, whileerrors would be introduced to the CA, the errors would not be presentedto an AU and instead the correct ASR engine text would be presented tothe AU. In some cases the system would allow a CA to actually correctthe erroneous text without knowing which errors were ASR generated andwhich were introduced. In other cases, when a CA selects an introducedtext error to make a correction, the interface may automatically makethe correction upon selection so that the CA does not waste additionaltime rendering a correction. In some cases, when an introduced error iscorrected either by the interface or the CA, a message may be presentedto the CA indicating that the error was a purposefully introduced error.

Referring to FIG. 41, a method 1350 that is consistent with at leastsome aspects of the present disclosure for introducing errors into anASR text stream for testing CA alertness is illustrated. At block 1352,an ASR engine generates ASR text segments corresponding to an HU voicesignal. At block 1354, a relay processor or ASR engine assignsconfidence factors to the ASR text and at block 1356, the relayidentifies at least one high confidence text segment as a “test”segment. At block 1358, the processor transmits the high confidence testsegment to an AU device for display to an AU. At block 1360, theprocessor identifies an error segment to be swapped into the ASRgenerated text for the test segment to be presented to the CA. Forinstance, where a high confidence test segment includes the phrase “Johncame home on Friday”, the processor may generate an exemplary errorsegment like “John camp home on Friday”.

Referring still to FIG. 41, at block 1362, the processor presents textwith the error segment to the CA as part of an ongoing text stream toconsider for error correction. At decision block 1364, the processormonitors for CA selection of words or phrases in the error segment to becorrected. Where the CA does not select the error segment forcorrection, control passes to block 1372 where the processor stores anindication that the error segment was not identified and control passesback up to block 1352 where the process continues to cycle. In addition,at block 1372, the processor may also store the test segment, the errorsegment and a voice clip corresponding to the test segment that maylater be accessed by the CA or an administrator to confirm the missederror.

Referring again to block 1364 in FIG. 41, if the CA selects the errorsegment for correction, control passes to block 1366 where the processorautomatically replaces the error segment with the test segment so thatthe CA does not have to correct the error segment. Here the test segmentmay be highlighted or otherwise visually distinguished so that the CAcan see the correction made. In addition, in at least some cases, atblock 1368, the processor provides confirmation that the error segmentwas purposefully introduced and corrected. To this end, see the“Introduced Error—Now Corrected” tag 1331 in FIG. 39 that may bepresented after a CA selects an error segment. At block 1370, theprocessor stores an indication that the error segment was identified bythe CA. Again, in some cases, the test segment, error segment andrelated voice clip may be stored to memorialize the error correction.After block 1370, control passes back up to block 1352 where the processcontinues to cycle.

In some cases errors may only be introduced when the rate of actual ASRengine errors and CA corrections is small. For instance, where a CA isroutinely making error corrections during a one minute period, it wouldmake no sense to introduce more text errors as the CA is most likelyhighly focused during that period. In addition, if a CA is substantiallydelayed in making corrections, the system may again opt to not introducemore errors.

Error introductions may include text additions, text deletions and textsubstitutions in some embodiments. In at least some cases the errorgenerating processor or CA interface may randomly generate errors of anytype and related to any ASR generated text. In other cases, theprocessor may be programmed to introduce meaningful errors calculated tochange the meaning of phrase so that a CA will be particularly motivatedto correct the text error when presented. To this end, it has beenrecognized that some errors have limited effect on the meaning of anassociated phrase while others can completely change the meaning of aphrase. Because ASR engines can understand context, they can also beprogrammed to ascertain when a simple text change will affect phrasemeaning and can therefore be used to drive an interface as suggestedhere. For instance, in some cases introduced errors may only includemeaningful errors. In other cases, introduced errors may include bothmeaningful errors and other errors that do not change the meaning ofassociated phrases and which would likely be recognized by an AU viewthe error and different statistics may be collected and stored for eachof the error types to develop metrics for judging CA effectiveness.

In some embodiments gamification can be enhanced by generating ongoing,real time dynamic scores for CA performance including, for instance, ascore associated with accuracy, a separate score associated withcaptioning speed and/or separate speed and accuracy scores underdifferent circumstances such as, for instance, for male and femalevoices, for east coast accents, Midwest accents, southern accents, etc.,for high speed talking and slower speed talking, for captioning withcorrecting versus captioning alone versus correcting ASR engine text,and any combinations of factors that can be discerned. In FIG. 40,exemplary accuracy and speed scores that are updated in real time for anongoing call are shown at 1343 and 1345, respectively. Where a callpersists for a long time, a rolling most recent sub-period of the callmay be used as a duration over which the scores are calculated.

CA scores may be stored as part of a CA profile and that profile couldbe routinely updated to reflect growing CA effectiveness with experienceover time. Once CA specific scores are stored in a CA profile, thesystem may automatically route future calls that have characteristicsthat match high scores for a specific CA to that CA which shouldincrease overall system accuracy and speed. Thus, for instance, if an HUprofile associated with a specific phone number indicates that anassociated HU has a strong southern accent and speaks rapidly, when acall is received that is associated with that phone number, the systemmay automatically route the call to a CA that has a high gamificationscore for rapid southern accents if such a CA is available to take thecall. In other cases it is contemplated that when a call is received ata relay where the call cannot be associated with an existing HU voiceprofile, the system may assign the call to a first CA to commencecaptioning where a relay processor analyzes the HU voice during thebeginning of the call and identifies voice characteristics (e.g., rapid,southern, male, etc.) and automatically switches the call to a second CAthat is associated with a high gamification score for the specific typeof HU voice. In this case, speed and accuracy would be expected toincrease after the switch to the second CA.

In addition, in some cases it is contemplated that in addition to theindividual speed and accuracy scores, a combined speed/accuracy scorecan be generated for each CA over the course of time, for each CA over awork period (e.g., a 6 hour captioning day), for each CA for each callthat the CA handles, etc. For example, an exemplary single scorealgorithm may including a running tally that adds one point for acorrect word and adds zero points for an incorrect word, where thecorrect word point is offset by an amount corresponding to a delay inword generation after some minimal threshold period (e.g., 2 secondsafter the word is broadcast to the CA for transcription or one secondafter the word is broadcast to and presented to a CA for correction).For instance, the offset may be 0.2 points for every second after theminimal threshold period. Other algorithms are contemplated. The singlescore may be presented to a CA dynamically and in real time so that CAis motivated to focus more. In other cases the single score per phonecall may be presented at the end of each call or an average score over awork period may be presented at the end of the work period. In FIG. 40,an exemplary current combined score is shown at 1347.

The single score or any of the contemplated metrics may also be relatedto other factors such as, for instance, how quickly errors are correctedby a CA, how many ASR errors need to be corrected in a rolling period oftime, how many manufactured or purposefully introduced errors are caughtand corrected, once a CA is behind, how does the CA respond, how fast anHU is speaking (WPM), how clear a voice signal is received (perhaps asmeasured by the ASR engine), ASR confidence factors associated with textgenerated during a call (as a proxy for captioning complexity), etc.

In at least some of the embodiments described above an AU has the optionto request CA assistance or more CA assistance than currently affordedon a call and or to request ASR engine text as opposed to CA generatedtext (e.g., typically for privacy purposes). While a request to changecaption technique may be received from a CA, in at least some cases thealternative may not be suitable for some reason and, in those cases, thesystem may forego a switch to a requested technique and provide anindication to a requesting AU that the switch request has been rejected.For instance, if an AU receiving CA generated and corrected textrequests a switch to an ASR engine but accuracy of the ASR engine isbelow some minimal threshold, the system may present a message to the AUthat the ASR engine cannot currently support captioning and the CAgeneration and correction may persist. In this example, once the ASRengine is ready to accurately generate text, the switch thereto may beeither automatic or the system may present a query to the AU seekingauthorization to switch over to the ASR engine for subsequentcaptioning.

In a similar fashion, if an AU requests additional CA assistance, asystem processor may determine that ASR engine text accuracy is low forsome reason that will also affect CA assistance and may notify the AUthat the a switch will not be made along with a reason (e.g.,“Communication line fault”).

In cases where privacy is particularly important to an AU on a specificcall or generally, the caption system may automatically, upon requestfrom an AU or per AU preferences stored in a database, initiate allcaptioning using an ASR engine. Here, where corrections are required,the system may present short portions of an HU's voice signal to aseries of CAs so that each CA only considers a portion of the text forcorrection. Then, the system would stitch all of the CA corrected texttogether into an HU text stream to be transmitted to the AU device fordisplay.

In some cases it is contemplated that an AU device interface may presenta split text screen to an AU so that the AU has the option to viewessentially real time ASR generated text or CA corrected text when thecorrected text substantially lags the ASR text. To this end, see theexemplary split screen interface 1450 in FIG. 45 where CA corrected textis shown in an upper field 1452 and “real time” ASR engine text ispresented in a lower field 1454. As shown, a “CA location” tag 1456 ispresented at the end of the CA corrected text while a “Broadcast” tag1458 is presented at the end of the ASR engine text to indicate the CAand broadcast locations within the text string. Where CA correctionlatency reaches a threshold level (e.g., the text between the CAcorrection location and the most recent ASR text no longer fits on thedisplay screen), text in the middle of the string may be replaced by aperiod indicator to indicate the duration of HU voice signal at thespeaking speed that corresponds to the replaced text. Here, as the CAmoves on through the text string, text in the upper field 1452 scrollsup and as the HU continued to speak, the ASR text in the bottom field1454 also scrolls up independent of the upper field scrolling rate.

In at least some cases it is contemplated that an HU may use acommunication device that can provide video of the HU to an AU during acall. For instance, an HU device may include a portable tablet typecomputing device or smart phone (see 1219 in FIG. 33) that includes anintegrated camera for telepresence type communication. In other cases,as shown in FIG. 33, a camera 1123 may be linked to the HU phone orother communication device 14 for collecting HU video when activated.Where HU video is obtained by an HU device, in most cases the video andvoice signals will already be associated for synchronous playback. Here,the HU voice and video signals are transmitted to an AU device, the HUvideo may be broken down into video segments that correspond with timestamped text and voice segments and the stamped text, voice and videosegments may be stored for simultaneous replay to the AU. Here, wherethere are delays between broadcast of consecutive HU voice segments astext transcription progresses, in at least some cases the HU video willfreeze during each delay. Similarly, if the HU voice signal is sped upduring a catch up period as described above, the HU video may be shownat a faster speed so that the voice and video broadcasts are temporallyaligned.

FIG. 42 shows an exemplary AU device screen shot 1308 includingtranscribed text 1382 and a video window or field 1384. Here, assumingthat all of the shown text at 1382 has already been broadcast to the AU,if the AU selects the phrase “you should bing the cods along” asindicate by hand icon 1386, the AU device would identify the voicesegment and video segment associated with the selected text segment andreplay both the voice and video segments while the phrase remainshighlighted for the user to consider.

Referring yet again to FIG. 33, in some cases the AU device or AUstation may also include a video camera 1125 for collecting AU videothat can be presented to the HU during a call. Here, it is contemplatedthat at least some HUs may be reticent to allow an AU to view HU videowithout having the reciprocal ability to view the AU during an ongoingcall and therefore reciprocal AU viewing would be desirable.

At least four advantages result from systems that present HU video to anAU during an ongoing call. First, where the video quality is relativelyhigh, the AU will be able to see the HU's facial expressions which canincrease the richness of the communication experience.

Second, in some cases the HU representation in a video may be useable todiscern words intended by an HU even if a final text representationthereof is inaccurate. For instance, where a text transcription erroroccurs, an AU may be able to select the phrase including the error andview the HU video associated with the selected phrase while listening tothe associated voice segment and, based on both the audio and videorepresentations, discern the actual phrase spoken by the HU.

Third, it has been recognized that during most conversations, peopleinstinctively provide visual cues to each other that help participantsunderstand when to speak and when to remain silent while others arespeaking. In effect, the visual cues operate to help people take turnsduring a conversation. By providing video representations to each of anHU and an AU during a call, both participants can have a good sense ofwhen their turn is to talk, when the other participant is strugglingwith something that was said, etc.

Fourth, for deaf AU's that are trained to read lips, the HU video may beuseable by the AU to enhance communication.

In at least some cases an AU device may be programmed to query an HUdevice at the beginning of a communication to determine if the HU devicehas a video camera useable to generate an HU video signal. If the HUdevice has a camera, the AU device may cause the HU device to issue aquery to the HU requesting access to and use of the HU device cameraduring the call. For instance, the query may include brief instructionsand a touch selectable “Turn on camera” icon or the like for turning onthe HU device camera. If the HU rejects the camera query, the system mayoperate without generating and presenting an HU video as describedabove. If the HU accepts the request, the HU device camera is turned onto obtain an HU video signal while the HU voice signal is obtained andthe video and voice signal are transmitted to the AU device for furtherprocessing.

There are video relay systems on the market today where speciallytrained CAs provide a sign language service for deaf AUs. In thesesystems, while an HU and an AU are communicating via a communicationlink or network, an HU voice signal is provided to a CA. The CA listensto the HU voice signal and uses her hands to generate a sequence ofsigns that correspond at least roughly to the content (e.g., meaning) ofthe HU voice messages. A video camera at a CA station captures the CAsign sequence (e.g., “the sign signal” and transmits that signal to anAU device which presents the sign signal to the AU via a display screen.If the AU can speak, the AU talks into a microphone and the AU's voiceis transmitted to the HU device where it is broadcast for the HU tohear.

In at least some cases it is contemplated that a second or even a thirdcommunication signal may be generated for the HU voice signal that canbe transmitted to the AU device and presented along with the sign signalto provide additional benefit to the AU. For instance, it has beenrecognized that in many cases, while sign language can come close to themeaning expressed in an HU voice signal, in many cases there is no exacttranslation of a voice message to a sign sequence and therefore somemeaning can get lost in the voice to sign signal translation. In thesecases, it would be advantageous to present both a text translation and asign translation to an AU.

In at least some cases it is contemplated that an ASR engine at a relayor operated by a fourth party server linked to a relay may, in parallelwith a CA generating a sign signal, generate a text sequence for an HUvoice signal. The ASR text signal may be transmitted to an AU devicealong with or in parallel with the sign signal and may be presentedsimultaneously as the text and sign signals are generated. In this way,if an AU questions the meaning of a sign signal, the AU can refer to theASR generated text to confirm meaning or, in many cases, review anactual transcript of the HU voice signal as opposed to a sometimes lessaccurate sign language representation.

In many cases an ASR will be able to generate text far faster than a CAwill be able to generate a sign signal and therefore, in at least somecases, ASR engine text may be presented to an AU well before a CAgenerated sign signal. In some cases where an AU views, reads andunderstands text segments well prior to generation and presentation of asign signal related thereto, the AU may opt to skip ahead and foregosign language for intervening HU voice signal. Where an AU skips aheadin this fashion, the CA would be skipped ahead within the HU voicesignal as well and continue signing from the skipped to point on.

In at least some cases it is contemplated that a relay or other systemprocessor may be programmed to compare text signal and sign signalcontent (e.g., actual meaning ascribed to the signals) so that timestamps can be applied to text and sign segment pairings thus enabling anAU to skip back through communications to review a sign signalsimultaneously with a paired text tag or other indicator. For instance,in at least some embodiments as HU voice is converted by a CA to signsegments, a processor may be programmed to assess the content (e.g.,meaning) of each sign segment. Similarly, the processor may also beprogrammed to analyze the ASR generated text for content and to thencompare the sign segment content to the text segment content to identifymatching content. Where sign and text segment content match, theprocessor may assign a time stamp to the content matching segments andstore the stamp and segment pair for subsequent access. Here, if an AUselects a text segment from her AU device display, instead of (or inaddition to in some embodiments) presenting an associated HU voicesegment, the AU device may represent the sign segment paired with theselected text.

Referring again to FIG. 33, the exemplary CA station includes, amongother components, a video camera 55 for taking video of a signing CA tobe delivered along with transcribed text to an AU. Referring also andagain to FIG. 42, a CA signing video window is shown at 1390 alongside atext field that includes text corresponding to an HU voice signal. InFIG. 42, if an AU selects the phrase labelled 1386, that phrase would bevisually highlighted or distinguished in some fashion and the associatedor paired sign signal segment should be represented in window 1390.

In at least some video relay systems, in addition to presenting sign andtext representations of an HU voice signal, an HU video signal may alsobe used to represent the HU during a call. In this regard, see againFIG. 42 where both an HU video window 1384 and a CA signing window 1390are presented simultaneously. Here, all communication representations1382, 1384 and 1390 may always be synchronized via time stamps in somecases while in other cases the representation may not be completelysynchronized. For instance, in some cases the HU video window 1384 mayalways present a real time representation of the HU while text and signsignals are 1382 and 1390 are synchronized and typically delayed atleast somewhat to compensate for time required to generate the signsignal as well as AU replay of prior sign signal segments.

In still other embodiments it is contemplated that a relay or othersystem processor may be programmed to analyze sign signal segmentsgenerated by a signing CA to automatically generate text segments thatcorrespond thereto. Here the text is generated from the sign signal asopposed to directly from the voice signal and therefore would match thesign signal content more closely in at least some embodiments. Becausethe text is generated directly from the sign signal, time stamps appliedto the sign signal can easily be aligned with the text signal and therewould be no need for content analysis to align signals. Instead of usingcontent to align, a sign signal segment would be identified and a timestamp applied thereto, then the sign signal segment would be translatedto text and the resulting text would be stored in the system databasecorrelated to the corresponding sign signal segment and the time stampfor subsequent access.

FIG. 44 shows yet another exemplary AU screen shot 1400 where textsegments are shown at 1402 and an HU video window is shown at 1412. Thetext 1402 includes a block of text includes a set of text lines wherethe block is presented in three visually distinguished ways. First, acurrently audibly broadcast word is highlighted or visuallydistinguished in a first way as indicated at 1406. Second, the line oftext that includes the word currently being broadcast is visuallydistinguished in a second way as shown at 1404. Other text lines arepresented above and below the line 1404 to show preceding text andfollowing text for context. In addition, the line at 1404 including thecurrently broadcast word at 1406 is presented in a larger format to callan Au's attention to that line of text and the word being broadcast. Thelarger text makes it easier for an AU to see the presented text.Moreover, the text block 1402 is controlled to scroll upward whilekeeping the text line that includes the currently broadcast wordgenerally centrally vertically located on the AU device display so thatthe AU can simply train her eyes at the central portion of the displaywith the transcribed words scrolling through the field 1404. In thiscase, a properly trained AU would know that prior broadcast words can bereplayed by tapping a word above field 1404 and that the broadcast canbe skipped ahead by tapping one of the words below field 1404. Videowindow 1412 is provided spatially close to field 1404 so that the textpresented therein is intuitively associated with the HU video in window1412.

In at least some embodiments it is contemplated that when a CA replacesan ASR engine to generate text for some reason where the CA revoices anHU voice signal to the ASR engine to generate the text, instead ofproviding the voice signal re-voiced by the CA to an ASR engine at therelay, the CA revoicing signal may be routed to the ASR engine that wasbeing used prior to convert the HU voice signal to text. Thus, forinstance, where a system was transmitting an HU voice signal to a fourthparty ASR engine provider when a CA takes over text generation viare-voicing, when the CA voices a word, the CA voice signal may betransmitted to the fourth party provider to generate transcribed textwhich is then transmitted back to the relay and on to the AU device forpresentation.

To apprise the public of the scope of the present invention thefollowing claims are made.

What is claimed is:
 1. A system for presenting substantiallysimultaneous voice and text to an assisted user (AU) during a voiceconversation between the AU and a hearing user (HU), the hearing userusing an HU device to talk to the assisted user, the system comprising:an AU captioned device including a device processor; a relay thatincludes a relay display, a relay speaker and a relay processor;wherein, at least one of the device processor and the relay processor isprogrammed to perform the steps of: (i) receiving an HU voice signalcomprising a sequence of HU voice segments; and (ii) assigning timestamps to each of the HU voice segments; wherein, the relay processor isprogrammed to perform the steps of: (i) generating text segmentscorresponding to each HU voice segment; (ii) storing each HU voicesegment along with a corresponding text segment and a corresponding timestamp in a memory device; (iii) broadcasting the HU voice segments to acall assistant (CA) via the relay speaker; and (iv) presenting each textsegment via the relay display substantially contemporaneously withbroadcast of the corresponding HU voice segment.
 2. The system of claim1 wherein the relay processor assigns at least a subset of the timestamps to the HU voice segments.
 3. The system of claim 1 wherein the AUdevice assigns at least a subset of the time stamps to the HU voicesegments.
 4. The system of claim 3 wherein the relay processor receivesthe HU voice signal from the AU device.
 5. The system of claim 4 whereinthe AU device transmits the HU voice segments and associated time stampsto the relay processor.
 6. The system of claim 5 wherein the relayfurther transmits the text segments along with the time stamps to the AUdevice processor.
 7. The system of claim 6 wherein the AU device furtherincludes a device display and a device speaker, the device processorfurther programmed to perform the steps of, broadcasting the HU voicesegments to AU via the device speaker and presenting each text segmentvia the device display substantially contemporaneously with broadcast ofthe corresponding HU voice segment via the device speaker.
 8. The systemof claim 1 wherein the step of generating text segments includes anautomatic speech recognition (ASR) engine automatically generating thetext segments from the HU voice segments.
 9. The system of claim 8wherein the relay links to an ASR engine provider server that generatesthe text segments and that assigns the time stamps to each text segment.10. The system of claim 1 wherein the AU device processor assigns timestamps to each text segment and the relay assigns time stamps to eachtext segment.
 11. The system of claim 1 wherein the relay furtherincludes a user interface, the relay processor further programmed tomonitor the interface for selection of at least a word within a textsegment displayed on the relay display and, upon selection of at least aword within the displayed text, halting broadcast of the HU voice signaland rebroadcasting the at least a word via the relay speaker.
 12. Thesystem of claim 7 wherein the AU device further includes a userinterface, the device processor further programmed to monitor theinterface for selection of at least a word within a text segmentdisplayed on the device display and, upon selection of at least a wordwithin the displayed text, halting broadcast of the HU voice signal viathe device speaker and rebroadcasting the at least a word via the devicespeaker.
 13. A system for presenting substantially simultaneous voiceand text to an assisted user (AU) during a voice conversation betweenthe AU and a hearing user (HU), the hearing user using an HU device totalk to the assisted user, the system comprising: an AU captioned deviceincluding a device processor, a device display and a device speaker; arelay that includes a relay processor; wherein, at least one of thedevice processor and the relay processor is programmed to perform thesteps of: (i) receiving an HU voice signal comprising a sequence of HUvoice segments; and (ii) assigning time stamps to each of the HU voicesegments; wherein, the relay processor is programmed to perform thesteps of: (i) generating text segments corresponding to each HU voicesegment; and (ii) transmitting the text segments and time stamps to theAU device; wherein, the device processor is programmed to perform thesteps of: (i) receiving the text segments and time stamps; (ii) storingeach HU voice segment along with a corresponding text segment and acorresponding time stamp in a memory device; (iii) broadcasting the HUvoice segments to the AU via the device speaker; and (iv) presentingeach text segment via the device display substantially contemporaneouslywith broadcast of the corresponding HU voice segment.
 14. The system ofclaim 13 wherein the AU device further includes a user interface, thedevice processor further programmed to monitor the interface forselection of at least a word within a text segment displayed on thedevice display and, upon selection of at least a word within thedisplayed text, halting broadcast of the HU voice signal via the devicespeaker and rebroadcasting the at least a word via the device speaker.15. The system of claim 13 wherein the relay transmits the text segmentsimmediately upon generation and wherein the device processor presentsthe text segments immediately upon reception, the step of presentingeach text segment commensurately including visually distinguishing thetext segment corresponding to the currently broadcast HU voice segment.16. The system of claim 13 wherein the relay feeds the HU voice signalto an automatic speech recognition (ASR) engine which generates the testsegments, the relay further including a call assistant (CA) station thatincludes a relay display, a relay speaker and a relay user interface, aCA using the station to view the text segments on the relay displaywhile listening to the HU voice signal, the device processor furtherprogrammed to monitor an AU device interface for selection of a textsegment on the device display that is subsequent to the text segmentcorresponding to a currently broadcast text segment and, upon receivingthe text segment selection, halting broadcast of the HU voice signal,identifying the HU voice segment associated with the selected textsegment and broadcasting the identified HU voice segment via the devicespeaker.
 17. The system of claim 16 wherein, upon receiving the textsegment selection, transmitting a text segment selection signal to therelay processor, upon receiving the text segment selection signal, therelay processor halting broadcast of the HU voice signal, identifyingthe HU voice segment associated with the selected text segment andbroadcasting the identified HU voice segment via the relay speaker. 18.A system for presenting substantially simultaneous voice and text to anassisted user (AU) during a voice conversation between the AU and ahearing user (HU), the hearing user using an HU device to talk to theassisted user, the system comprising: an AU captioned device including adevice processor, a device display and a device speaker; a relay thatincludes a relay processor, a relay display and a relay speaker;wherein, each of the AU device and the relay performs the steps of: (i)receiving an HU voice signal comprising a sequence of HU voice segments;and (ii) storing time stamps with each of the HU voice segments; (iii)presenting the text segments via the displays; wherein the relayprocessor is further programmed to perform the steps of: (i)broadcasting each voice segment via the relay speaker substantiallycommensurate with presenting the corresponding text segment via therelay display; wherein the device processor is further programmed toperform the steps of: (i) broadcasting each voice segment via the devicespeaker substantially commensurate with presenting the correspondingtext segment via the device display; (ii) monitoring for a signal fromthe AU to skip ahead in the voice segment broadcast; and (iii) uponreceiving the signal to skip ahead, transmitting a skip ahead signal tothe relay thereby causing the relay to automatically skip ahead in HUvoice segment broadcast.